[Libosinfo] RFC: Splitting off database into separate package

Wed Jul 22 10:46:23 UTC 2015

Currently we distribute the database alongside the library code. We have
long known this was no ideal over the long term, since the database is
updated much more frequently than the code needs to be, and also people
wish to be able to consume database updates without updating the code.

As such, I think it is time to look at splitting the database off into
a separate package and git repository from the code, which we can then
distribute and release on separate timelines. In fact for the database
we'd probably just want to make available daily automated tar.gzs from
git, rather than doing manual releases.

In the new GIT repository I think we'd need to have the following pieces
of the current codebase

 - data/
 - tests/isodata/
 - tests/test-isodetect.c

When doing an automated release, we'd want to run some test to make
sure the database is syntactically valid & can be successfully loaded,
as well as the existing iso unit tests.

I think we also need to consider what we need to do to future proof
ourselves, because once we start distributing the database separately
from the code, we are making a much stronger committment to supporting
the current database format long term. From that POV, I think we need
to consider a few things

 - Is XML the format we want to use long term ?

   We already ditched XML for the PCI & USB ID databases, in favour of
   directly loading the native data sources because XML was just too
   damn slow to parse. I'm concerned that as we increase the size of
   the database we might find this becoming a more general problem.

   So should we do experiments to see if something like JSON or YAML
   is faster to load data from ?

   If we want to use a different format, should we do it exclusively
   or in parallel

   eg should we drop XML support if we switch to JSON, or should
   we keep XML support and automatically generate a JSON version
   of the database.

 - Do we need to mark a database schema version in some manner ?

   eg any time we add new attribute or elements to the schema we
   should increment the version number. That would allow the library
   to detect what version data it is receiving. Even though old
   libraries should be fine accepting new database versions, and
   new libraries should be fine accepting old database versions,
   experiance has told me we should have some kind of versioning
   infomation as a "get out of jail free card"

 - Should we restructure the database ?

   eg, we have a single data/oses/fedora.xml file that contains
   the data for every Fedora release. This is already 200kb in
   size and will grow forever. If we split up all the files
   so there is only ever one entity (os, hypervisor, device, etc)
   in each XML file, each file will be smaller in size. This would
   also let us potentially do database minimization. eg we could
   provide a download that contains /all/ OS, and another download
   that contains only non-end-of-life OS.

 - Should we formalize the specification so that we can officially
   support other library implementations

   While libosinfo is accessible from many languages via GObject
   introspection, some projects are still loathe to consume python
   libraries backed by native code. eg openstack would really
   prefer to be able to just pip install a pure python impl.

   Currently libosinfo library includes some implicit business
   logic about how you load the database, and dealing with overrides
   from different files. eg if you have the same OS ID defined in
   multiple XML files which one "wins". Also which paths are supposed
   to be considered when loading files. In the future also possibly
   how to download live updates over the net. It also has logic about
   how you detect ISO images & install trees from the media data and
   how to generate kick start files, etc, none of which is formally
   specified or documented.

 - How do we provide notifications when updates are available

   eg, we don't want 1000's of clients checking the libosinfo website
   daily to download a new database, if it hasn't changed since they
   last checked. Can we efficiently provide info about database updates
   so people can check and avoid downloading if it hasn't changed. I
   have thought about perhaps adding a DNS TXT record that records
   the SHA256 checksum of the database, so clients can do a simple
   DNS lookup to check for update availability. This is nice and scalable
   thanks to DNS server caching & TTLs, avoiding hitting the webserver
   most of the time.

Regards,
Daniel
-- 
|: http://berrange.com      -o-    http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org              -o-             http://virt-manager.org :|
|: http://autobuild.org       -o-         http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org       -o-       http://live.gnome.org/gtk-vnc :|