PyRPM Undercovered (Developer notes)
====================================


Overview
--------

In this readme we want to provide some more technical information about RPM
and PyRPM itself and how everything works. Some of the collected information
here can be found in other documents spread out over the Internet in more
detail. See the related projects section at the end of this file.


Developer Notes
---------------

Use `pychecker` or `pylint` to improve code quality, 4 space indents and
no tabs. Most scripts support giving `\--hotshot` as first option to run
them with the hotshot python profiler.
`oldpyrpm.py` has some TODO items listed in the source.


Is Python the right language?
-----------------------------

Python is a very good scripting language. Very nice for fast prototyping
and implementing new features. We currently see limits if too much data
needs to be processed. This is happening with parsing xml data as well
as with the huge amount of dependency data in rpm packages (where all
files in rpm packages can also be used for dependencies, those are
about 350000 files in FC-development).

These python modules could see some improvements:

 - python-elementtree is about 3 times faster in reading in the comps.xml
   file compared to libxml2. elementtree still does some internal automatic
   detection and conversion between data encodings that looks like it might
   add overhead.
   Migrating to python-elementtree should either be done for the complete
   code or not at all. Since libxml2 is the default xml library, we'll
   stick with it for now.
   XML processing is one of the biggest CPU consumers right now.
 - gzip decompression should offer a filetype object with .read().
   The current hacks in the class PyGZIP should go away.
 - bsddb access seems very slow (reading rpmdb)
 - bzip2 decompression does not have streaming support


RPM basic principles
--------------------

At first we want to provide some high level information about how rpm works
without going into too much techincal detail. This should help understand how
rpm works and what problems it solves (and which it doesn't and why yum is
needed). Following that will be a section that describes then what yum does
and what problems it solves (and again, which it can't solve and which we try
to solve in our yum derivate). Afterwards we will go more into the rpm binary
format and the format of yum repositories as well as several interessting
points about the way rpm works and the problems that appear with doing to.


What are dependencies?
----------------------

Rpms whole concept is based on socalled dependencies which define the relation
between packages. Those are simply expressed using Requires and Provides. In
mathematical terms this can be viewed as a directed graph where the nodes are
packages and the edges are requirements from packages that require something to
packages that provide that requirement.
To make this a little more visible here a small ascii art:

 A ----> B ----> C

In this example:

A requires B

 and

B requires C

So in order to be able to install A we need to install B and C as well, as
B needs C.

Requires and provides can be versioned and have additional flags like <, <=,
==, >=, > which need to be handled properly.


How are versions compared? (by Tom "Spot" Callaway and Paul Nasrat)
-------------------------------------------------------------------

The rpmvercmp algorithm compares two labels (like the Version or the Release
tag) to see which is newer.

In this algorithm, "digits" and "letters" are defined as ASCII digits
('0'-'9') and ASCII letters ('a'-'z' and 'A'-'Z'). Other Unicode digits and
letters (like accented Latin letters) are not considered letters. ASCII
letters and digits are called "alphanumeric" characters.

Please note that the algorithm's actions is undefined in some cases, in a ways
may make the resulting comparisons stop working sanely (see [WWW]
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=178798 for an example
where the order of the comparison is more important than the operands). To
avoid these, make sure that all your labels start and end with alphanumeric
characters. So while things like "1.", "+a", or "_" are allowed as labels, the
result of such comparisons are undefined. For the exact (non-symmetric)
algorithm, see lib/vercmp.c in the RPM source code. The following algorithm is
a simplification based on the version available in FC4, and is considered to
be stable, as the last time it changed in any way was in January 2003.

 1. Each label is separated into a list of maximal alphabetic or numeric
sections, with separators (non-alphanumeric characters) ignored. If there is
any extra non-alphanumeric character at the end, that. So, '2.0.1' becomes
('2', '0', '1'), while ('2xFg33.+f.5') becomes ('2', 'xFg', '33', 'f', '5').

 2. All numbers are converted to their numeric value. So '10' becomes 10,
'000230' becomes 230, and '00000' becomes 0.

 3. The elements in the list are compared one by one using the following
algorithm. If two elements are decided to be different, the label with the
newer element wins as the newer label. If the elements are decided to be
equal, the next elements are compared until we either reach different elements
or one of the lists runs out. In case one of the lists run out, the other
label wins as the newer label. So, for example, (1, 2) is newer than (1, 1),
and (1, 2, 0) is newer than (1, 2).

The algorithm for comparing list elements is as follows:

 1. If one of the elements is a number, while the other is alphabetic, the
numeric elements is considered newer. So 10 is newer than 'abc', and 0 is
newer than 'Z'.

 2. If both the elements are numbers, the larger number is considered newer.
So 5 is newer than 4 and 10 is newer than 2. If the numbers are equal, the
elements are decided equal.

 3. If both the elements are alphabetic, they are compared using the Unix
strcmp function, with the greater string resulting in a newer element. So 'b'
is newer than 'a', 'add' is newer than 'ZULU' (because lowercase characters
win in strcmp comparisons), and 'aba' is newer than 'ab'. If the strings are
identical, the elements are decided equal.

Examples

Some random examples, to make sure you understand the rpmvercmp algorithm:

   1.  '1.0010' is newer than '1.9' because 10 is more than 9.

   2.  '1.05' is equal to '1.5', because both '05' and '5' are treated as the
number 5.

   3.  '1.0' is newer than '1', because it has one more element in the list,
while previous elements are equal.

   4.  '2.50' is newer than '2.5', because 50 is more than 5.

   5.  'fc4' is equal to 'fc.4', because the alphabetic and numeric sections
will always get separated into different elements anyway.

   6.  'FC5' is older than 'fc4', because it uses uppercase letters.

   7.  '2a' is older than '2.0', because numbers are considered newer than
letters.

   8.  '1.0' is newer than '1.fc4' because numbers are considered newer than
letters.

   9.  '3.0.0_fc' is the same as '3.0.0.fc', because the separators themselves
are not important.


What are obsoletes?
-------------------

Sometimes it is necessary to replace an installed package with a new package
that provides the same functionality as an installed one. By obsoleting that
installed package in the new package the installed package will be removed
during an update and replaced with the new package. Obsoletes can be versioned
and have flags just like requires and provides.

An example was when ucd-snmp got replaced by net-snmp. There the new net-snmp
package contained a "Obsoletes: ucd-snmp" and, as it provided the
functionality of the old ucd-snmp a "Provides: ucd-snmp".


What are conflicts?
-------------------

In rpm you can specify that your package conflicts with something another
package provides. This can be used so that e.g. packages that are know not
to work with one another can refuse to be installed if the other is already
there. Conflicts are tested against the provides, just like requires. But
instead of fullfilling a requirement they produce an error if there is a match.
Conflicts can be versioned and have flags just like requires and provides.


What are fileconflicts?
-----------------------

If you look at a set of rpm packages, then a fileconflict is any duplicate
filename where two files differ in their filemode (file permissions as well
as file type like directory, regular file, etc), their filemd5sum (which is
only set for regular files), their fileusername or their filegroupname.
To support multilib/compatarch installations also all fileconflicts where
one file is a ELF32 and the other is a ELF64 binary are ignored.

Since fileconflicts do occur pretty frequently, the default setting is to
ignore them. yum within Fedora Core has fileconflict detection enabled.


How does dependency check work?
-------------------------------

Dependency checking is fairly simple: For a given set of packages take all
requirements and check for every requirement if any package provides this
requirement, possibly restricted by it's version and flags.


How does ordering work?
-----------------------

Given a set of packages to determine the order in which the packages need to
be installed (or removed) a dependency graph is constructed. All packages with
no requirements can be installed first. Afterwards all requires of the remaining
packages that were provided by the installed packges get removed. This normally
results in a new set of packages without any requirements. This process is then
repeated until either no more packages need to be installed or we have
requirement loops between packages. These loops then need to be broken up by
intelligently selecting one dependency and removing that until we have
packages without requirements again. If there are several packages in one round
without requirements the order of those are independant and can be installed
in any order.

A special case for the loop breaking are PreReqs. Those requirements are
special in that sense that they should never ever be broken up if possible as
they specify that a certain package absolutely has to be installed before the
other one can be safely and correctly installed. This happens often if a
package needs some specific binaries in one of it pre or post scripts.


How does dependency resolving work?
-----------------------------------

Now that we know how dependency checks and ordering works the next step is
dependency resolving. Here we need to leave the space of a single set of rpms
and include some kind of repository from which the depresolver can pull
packages. A typical scenario is where you have an installed system with a set
packages B, a set of packages to be installed or updated I and a set of
packages in one or more repositories called R.

The depresolver now takes all requirements from the packages in I and checks
which are still not resolved by B. For those unfulfilled requirements it then
looks at the packages in R and tries to find packages that fulfill those
requirments and add the best matching package to I. This process is repeated
until all requirments are fullfilled or aborted if they can't be resolved
using R.

An important fact is that dependencies are "arch-less", meaning if you have a
requirement on multilib systems this can lead to 64bit packages being pulled
it from 32bit packages and vice versa. For libraries this can and usally will
automatically be prevented as 64bit libraries on 64bit systems have a (64bit) 
at the end of their provides, so they specifically and correcltly will be
required by binaries linked against them.

More problematic on the other hand are development packages where there a
package foo-devel typically only has a "Requires: foo" as a requirement which
would pull in a 64bit foo package for a 32bit foo-devel package, which is
most of the time not what you want.


What is special about multilib systems?
---------------------------------------

Multilib systems (mixed 32 and 64bit systems like AMD64 or S390x) have some
special rules that apply to them as on those architectures both 32bit and 64bit
packages can be installed at the same time. Each file in a rpm has a color. The
color is 0 if the file is arch independant (text files, etc), 1 if it's a 32bit
binary and 2 if it is a 64bit binary. If the color is 0, normal file conflict
handling is done. If the color larger than 0 and both colors are equal, again,
normal file conflicts and handling is done. If both are larger than 0 and
differ then the "higher" color wins, meaning the 64bit binary. No file
conflicts will be done for that case either.


How do Updates work?
--------------------

There are 2 basic forms for updates: Specified or complete updates. For the
former this is easily done by looking for packages that match the ones given
on the command line. Those can either be names (with version, release and arch
information, of course), regular expressions or binary rpm packages. For the
later a special pre updates pass has to be done in order to correctly obsolete
old packages with newer ones (like the changes from xorg-x11 to modular-x from
FC3 to FC4). Other than that it's identical to the specified case where the
names to be updates are simply the names of all packages installed on the
system.

For a single arch system it is fairly easy to find the best update package: We
just look for the package with the "highest" arch (except if exactarch is
specified in which case the arch of the update package needs to exactly match
the one of the installed package) and then the highest version.

For multilib system things get a lot trickier. If the given update package is
either marked as "install only" (like usually kernel packages) or if no package
with that name has already been installed we use the simple algorithm as
before, selecting the highest arch/version match for the update. If exactarch
has been set then we just need to go through all installed archs of each name
and find the highest versions for each arch.

If exactarch isn't set we now need to determine if more than one major arch
for a name is installed. Major arch in this case means either 32bit or 64bit,
e.g i386, i486, i586, i686, athlon for 32bit and x86_64 for 64bit. If there is
only one we can and do allow even major arch changes (from 32bit to 64bit and
vice versa). If packages for more than one major arch are already installed we
need to find updates for each of the installed major archs and do the final
selection as in the previous cases.


How do pre/post install/uninstall scripts and triggers work?
------------------------------------------------------------

RPM provides the ability to have packages execute a set of scripts before and
after installation and uninstallation. This often is needed to either prepare
the system for the actual installation, do some post processing after a
package is installed, some pre processing before a package is uninstalled or
cleanup after a packages has been uninstalled.

Although the basic idea is fairly simple it still can cause some confusion
when scripts are executed for example during updates of packages. This is the
order how an update of a package is done in relation to the scripts:

 * Run %pre of new package
 * Install new files
 * Run %post of new package
 * Run %preun of old package
 * Delete any old files not overwritten by newer ones
 * Run %postun of old package

So during an update the new package gets installed first with all it's scripts
and then the old package with all it's scripts get uninstalled.

Sometimes it is necessary for packages to run scripts when other packages are
installed or uninstalled. This can be done using trigger scripts. Triggers are
therefore very similar to requirements and are written the same way in the
spec files. The order of how and when trigger scripts are called looks like
this:

 * Run %pre for new version of package being installed
 * Install new files
 * Run %post for new version of package being installed
 * Run %triggerin from other packages set off by new install
 * Run %triggerin of new package
 * Run %triggerun of old package
 * Run %triggerun from other packages set off by old uninstall
 * Run %preun for old version of package being removed
 * Delete any old files not overwritten by newer ones
 * Run %postun for old version of package being removed
 * Run %triggerpostun of old package
 * Run %triggerpostun from other packages set off by old uninstall

Looks a little more complicated, but allows to perform postprocessing by other
packages after some package has been installed or uninstalled. Take notice
that there isn't a %triggerpostin and that %triggerin basically behaves like a
%triggerpostin.


How does a binary rpm look like?
--------------------------------

For RPM there are nowadays several "formats" in which you can find information
about rpm packages. The most typical one is of course the binary rpm header
which is part of every binart rpm package. A typical binary rpm package looks
like this:

-------------------------------------------
+------+-----------+--------+-------------+
| Lead | Signature | Header | Gziped CPIO |
+------+-----------+--------+-------------+
-------------------------------------------

The lead has a fixed size of 96 bytes and contains some very basic information
about the binary rpm. It can also generally be used to determine if a file is
a binary rpm or not (using file e.g.) as it contains some very specific to
easily identify them.

The signature and the header are stored as rpm header structures. Rpm header
structures look like this:

-------------------------------------------------------
+-------+---------+-----------+-----------+-----------+
| Magic | IndexNr | StoreSize | Indexdata | Storedata |
+-------+---------+-----------+-----------+-----------+
-------------------------------------------------------

The Magic is a hardcoded value, IndexNr the number of index entries and
StoreSize the size in bytes of the store data.

Indexdata consists of IndexNr index entries each of which is 16 bytes. Each
index entry looks like this:

-------------------------------
+-----+------+--------+-------+
| Tag | Type | Offset | Count |
+-----+------+--------+-------+
-------------------------------

Tag specifies which tag this entry is about. Type specifies the type of the
tage. Offset specifies at which offset in the Storedata the data begins for
this tag. Count has various size meanings depending on the type.

Storedata finally contains the real tag information. As mentioned in the
previous paragraph by using an index entry from the Indexdata you can find
and parse all data relevant to a specifc tag. The format depends of course
on the type of the tag.

More detailed information about the binary rpm format can be found here:
link:http://www.rpm.org/max-rpm/s1-rpm-file-format-rpm-file-format.html[]

The rpm binary format can be partially found in the rpmdb as well. The file
/var/lib/rpm/Packages contains the complete headers of the orignal binary
rpms in a rpm header structure format without the 8 byte magic and with
some additional installation revelvant indexes appended.

Another nowadays common format for reduced rpm header data is the repo metadata
format used by yum. It is a split up and reduced version of the orignal
rpm header information using XML. It is mainly useful to determine and resolve
dependencies of rpm packages. More information about the metadata can be found
here:

link:http://linux.duke.edu/projects/metadata/[]

Other less common storage formats include databases like SQLite or MySQL which
e.g yum uses to convert the repodata format to a more usable form locally.

Apart from that rpm itself extracts quite a bit of the information from rpm
binary headers and writes them in various db4 files in /var/lib/rpm.


RPM database internals
----------------------

This section describes the structure from the various files in
/var/lib/rpm. All files are db4 files, either hash or btree based. With the
exception of Packages all files have the corresponding rpmtag based value as
key. The data consists of integer pairs which contain the package id and
the index at which this entry can be found in the rpm header of that tag. The
values are 4 byte integers in host byte order. For some tags the index doesn't
make any sense. In those cases the index value will always be set to 0.


Filelist
~~~~~~~~

Basenames (hash)::
 * key: Basename (string)
 * values: list of 2-tuples: installid (4 byte int), basenameindex (4 byte int)

Conflictname (hash)::
 * key: Conflictname (string)
 * values: list of 2-tuples: installid (4 byte int), conflictindex (4 byte int)

Dirnames (btree)::
 * key: Dirname (string)
 * values: list of 2-tuples: installid (4 byte int), dirindex (4 byte int)

Filemd5s (hash)::
 * key: md5sum (4 * 4 byte int, no hex string!)
 * values: list of 2-tuples: installid (4 byte int), filemd5sindex (4 byte int)

 Only stored if file md5sum exists and if the file is a regular file (usually
 equivalent)

Group (hash)::
 * key: Groupname (string)
 * values: list of 2-tuples: installid (4 byte int), index (4 byte int) (always 0)

Installtid (btree)::
 * key: Installtime of transaction (4 byte int, time() value)
 * values: list of 2-tuples: installid (4 byte int), index (4 byte int) (always 0)

Name (hash)::
 * key: Packagename (string)
 * values: list of 2-tuples: installid (4 byte int), index (4 byte int) (always 0)

Packages (hash)::
 * key: Installid (4 byte int)
 * values: Complete binary rpm header with some  additional information from
           signature without lead.

Providename (hash)::
 * key: Providename (string)
 * values: list of 2-tuples: installid (4 byte int), providenameindex (4 byte int)

Provideversion (btree)::
 * key: Provideversion (string)
 * values: list of 2-tuples: installid (4 byte int), provideversionindex (4 byte int)

Pubkeys (hash)::
 * key: unknown yet
 * values: unknown yet

Requirename (hash)::
 * key: Requirename (string)
 * values: list of 2-tuples: installid (4 byte int), requirenameindex (4 byte int)

 Only contains the requirenames of not install prereqs

Requireversion (btree)::
 * key: Requireversion (string)
 * values: list of 2-tuples: installid (4 byte int), requireversionindex (4 byte int)

Sha1header (hash)::
 * key: Sha1header (string) (just as the value from the header)
 * values: list of 2-tuples: installid (4 byte int), index (4 byte int) (always 0)

Sigmd5 (hash)::
 * key: md5sum from header (4 * 4 byte int)
 * values: list of 2-tuples: installid (4 byte int), index (4 byte int) (always 0)

Triggername (hash)::
 * key: Triggername (string)
 * values: list of 2-tuples: installid (4 byte int), triggerindex (4 byte int)

 Only contains the first entry for each name from a package


Example
~~~~~~~

Now an example of the connection between the package headers which are stored
in Packages and the rest of the files.

The connection between /var/lib/rpm/Packages and the other files looks like
this:

/var/lib/rpm/Packages:
^^^^^^^^^^^^^^^^^^^^^^
'----------'-----------'-----
 Package id Requirename Index
-----------------------------
 5          a           0
            b           1
 8          c           0
            a           1
            b           2
-----------------------------


/var/lib/rpm/Requirename:
^^^^^^^^^^^^^^^^^^^^^^^^^
'-----------'----------'-----
 Requirename Package Id Index
-----------------------------
 a           5          0
             8          1
 b           5          1
             8          2
 c           8          0
-----------------------------

That means the complete /var/lib/rpm files can be cross checked with
/var/lib/rpm/Packages and can be regenerated from that file as well.

An exception is Installtid. This db file contains as keys the TID which is a
unique time in seconds since 1970 that reflects a complete transaction. Every
header in Packages contains that TID as "installtid" tag. The values of the
Installtid db file are again pairs of integers with a package id as first
value and the second value always 0. Here a small example:


/var/lib/rpm/Packages:
^^^^^^^^^^^^^^^^^^^^^^
'----------'-----------
 Package id Install Tid
-----------------------
 5          1000000
 8          1000000
 6          1234567
 9          1234567
 7          2345678
-----------------------

/var/lib/rpm/Installtid:
^^^^^^^^^^^^^^^^^^^^^^^^
'-----------'----------'-----
 Install Tid Package ID Index
-----------------------------
 1000000     5          0
             8          0
 1234567     6          0
             9          0
 2345678     7          0
-----------------------------

As you can see it can happen that package ID's get reused, in our example 6.
This can happen if a package gets deleted and the ID "dropped". So there is
unfortunately no autoincrementing ID for the packages.


Huge Dependency Data
--------------------

The data eating up RAM in rpm headers are descriptions, changelogs and
filelists.

The dependency data we operate with is extremely huge. In addition to the
`Provides:` data which contains shared libs, rpm versions and explicitely
listed ones in .spec files, dependency data can also use any filerequires
like e.g. `Requires: /usr/bin/foo` to reference any file in any other rpm
package. That means we potentially have to look at a filelist of all rpm
packages. That data is extremely huge as the current Fedora Core
development tree contains more than 350000 files.

As the dependency data is worked with on each client to update the machine,
it must be a goal to reduce this data to a smaller subset.

The current repo metadata has a fixed file regex of
`^(.\*bin/.\*|/etc/.\*|/usr/lib/sendmail)$` and a directory regex of
`^(.\*bin/.\*|/etc/.\*)$`. That regex specifies the data given in the
`repodata/primary.xml.gz` file and you have to fallback to the complete
filelists available in `repodata/filelists.xml.gz` if any dependency
request is done outside of that data. (The regex gives a deterministic way
to know when to load the full filelist.) The regex used to be pretty
complete for Fedora Core in the past, but additional filerequires are
present in newer Fedora Core and Fedora Extra rpm packages which require
a reload of the complete lists.

In addition to the completeness problems above, it was also noted that
the regex lists contain 100 times more data than actually being used in
current repositories. Conary is thus maintaining explicit lists of
possible file requires. Maybe new ways to add autogenerated, small filelists
can be worked out that would work for most comon usage cases, also with
the fallback to the complete lists like yum / createrepo implement right
now.


Storing Complete Dep Graphs
---------------------------

It would also be possible to store dependency graphs that contain data for
the resolver to select the right rpm packages plus the orderer to specify
the right sequence to install them. But many machines do have further
packages installed outside of that package set, so this would then mostly
be used for new installs. Optimizing the general update path for running
machines should be more important than improving the install path for
new installs, so this is currently no goal, but would very well be possible
todo.


Notes about the Repo-Metadata
-----------------------------

The following things should be noted about the repo metadata. yum is using
the repodata only within the resolver part to determine a set of rpms that
should be updated and/or installed. Then the complete rpm headers are
downloaded and another dependency check from librpm is run in addition to
determining the ordering of rpm packages.

Here a few limitations you should be aware of if you want to work with the
repodata for more than the resolver or understand the limits of the
resolver:

 - Repodata has evolved over time. Until now no version information has been
   added to the created data, this might make sense for future changes.
 - Even if no epoch is specified in the rpm header, the metadata will
   specify this as "0". That's the correct way for version and dependency
   checks.
 - Dependency information is often specified like `bash >= 3.0` and consists
   of a (name, flag, version) triple. The flag part is specified as integer
   within the rpm header and is only partially copied over into the repodata.
   Installation ordering of rpm packages is not possible with the current
   available data (or only based on reduced data). Future repodata could make
   the data more complete or just copy the integer into the output to provide
   it as exact copy.
   (Repo data adds a "pre" flag if the RPMSENSE_PREREQ flag is set. That
   information is actually not complete to identify install prereq versus
   an erase prereq.)
 - The `primary.xml.gz` file contains a subset of the included files. Because
   of thise some operations cannot be equivalently with binary rpms or
   repodata headers.

