[Libosinfo] RFC: Splitting off database into separate package
Daniel P. Berrange
berrange at redhat.com
Wed Jul 22 10:46:23 UTC 2015
Currently we distribute the database alongside the library code. We have
long known this was no ideal over the long term, since the database is
updated much more frequently than the code needs to be, and also people
wish to be able to consume database updates without updating the code.
As such, I think it is time to look at splitting the database off into
a separate package and git repository from the code, which we can then
distribute and release on separate timelines. In fact for the database
we'd probably just want to make available daily automated tar.gzs from
git, rather than doing manual releases.
In the new GIT repository I think we'd need to have the following pieces
of the current codebase
- data/
- tests/isodata/
- tests/test-isodetect.c
When doing an automated release, we'd want to run some test to make
sure the database is syntactically valid & can be successfully loaded,
as well as the existing iso unit tests.
I think we also need to consider what we need to do to future proof
ourselves, because once we start distributing the database separately
from the code, we are making a much stronger committment to supporting
the current database format long term. From that POV, I think we need
to consider a few things
- Is XML the format we want to use long term ?
We already ditched XML for the PCI & USB ID databases, in favour of
directly loading the native data sources because XML was just too
damn slow to parse. I'm concerned that as we increase the size of
the database we might find this becoming a more general problem.
So should we do experiments to see if something like JSON or YAML
is faster to load data from ?
If we want to use a different format, should we do it exclusively
or in parallel
eg should we drop XML support if we switch to JSON, or should
we keep XML support and automatically generate a JSON version
of the database.
- Do we need to mark a database schema version in some manner ?
eg any time we add new attribute or elements to the schema we
should increment the version number. That would allow the library
to detect what version data it is receiving. Even though old
libraries should be fine accepting new database versions, and
new libraries should be fine accepting old database versions,
experiance has told me we should have some kind of versioning
infomation as a "get out of jail free card"
- Should we restructure the database ?
eg, we have a single data/oses/fedora.xml file that contains
the data for every Fedora release. This is already 200kb in
size and will grow forever. If we split up all the files
so there is only ever one entity (os, hypervisor, device, etc)
in each XML file, each file will be smaller in size. This would
also let us potentially do database minimization. eg we could
provide a download that contains /all/ OS, and another download
that contains only non-end-of-life OS.
- Should we formalize the specification so that we can officially
support other library implementations
While libosinfo is accessible from many languages via GObject
introspection, some projects are still loathe to consume python
libraries backed by native code. eg openstack would really
prefer to be able to just pip install a pure python impl.
Currently libosinfo library includes some implicit business
logic about how you load the database, and dealing with overrides
from different files. eg if you have the same OS ID defined in
multiple XML files which one "wins". Also which paths are supposed
to be considered when loading files. In the future also possibly
how to download live updates over the net. It also has logic about
how you detect ISO images & install trees from the media data and
how to generate kick start files, etc, none of which is formally
specified or documented.
- How do we provide notifications when updates are available
eg, we don't want 1000's of clients checking the libosinfo website
daily to download a new database, if it hasn't changed since they
last checked. Can we efficiently provide info about database updates
so people can check and avoid downloading if it hasn't changed. I
have thought about perhaps adding a DNS TXT record that records
the SHA256 checksum of the database, so clients can do a simple
DNS lookup to check for update availability. This is nice and scalable
thanks to DNS server caching & TTLs, avoiding hitting the webserver
most of the time.
Regards,
Daniel
--
|: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org -o- http://virt-manager.org :|
|: http://autobuild.org -o- http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|
More information about the Libosinfo
mailing list