[Libosinfo] RFC: Splitting off database into separate package

Fri Jul 24 10:03:10 UTC 2015

Ping, anyone have any thoughts on these general ideas? I'm not
suggesting we do everything at once neccessarily, but I'd like
us to at least figure out a direction to move forwards in...

On Wed, Jul 22, 2015 at 11:46:23AM +0100, Daniel P. Berrange wrote:
> Currently we distribute the database alongside the library code. We have
> long known this was no ideal over the long term, since the database is
> updated much more frequently than the code needs to be, and also people
> wish to be able to consume database updates without updating the code.
> 
> As such, I think it is time to look at splitting the database off into
> a separate package and git repository from the code, which we can then
> distribute and release on separate timelines. In fact for the database
> we'd probably just want to make available daily automated tar.gzs from
> git, rather than doing manual releases.
> 
> In the new GIT repository I think we'd need to have the following pieces
> of the current codebase
> 
>  - data/
>  - tests/isodata/
>  - tests/test-isodetect.c
> 
> When doing an automated release, we'd want to run some test to make
> sure the database is syntactically valid & can be successfully loaded,
> as well as the existing iso unit tests.
> 
> I think we also need to consider what we need to do to future proof
> ourselves, because once we start distributing the database separately
> from the code, we are making a much stronger committment to supporting
> the current database format long term. From that POV, I think we need
> to consider a few things
> 
>  - Is XML the format we want to use long term ?
> 
>    We already ditched XML for the PCI & USB ID databases, in favour of
>    directly loading the native data sources because XML was just too
>    damn slow to parse. I'm concerned that as we increase the size of
>    the database we might find this becoming a more general problem.
> 
>    So should we do experiments to see if something like JSON or YAML
>    is faster to load data from ?
> 
>    If we want to use a different format, should we do it exclusively
>    or in parallel
> 
>    eg should we drop XML support if we switch to JSON, or should
>    we keep XML support and automatically generate a JSON version
>    of the database.
> 
>  - Do we need to mark a database schema version in some manner ?
> 
>    eg any time we add new attribute or elements to the schema we
>    should increment the version number. That would allow the library
>    to detect what version data it is receiving. Even though old
>    libraries should be fine accepting new database versions, and
>    new libraries should be fine accepting old database versions,
>    experiance has told me we should have some kind of versioning
>    infomation as a "get out of jail free card"
> 
>  - Should we restructure the database ?
> 
>    eg, we have a single data/oses/fedora.xml file that contains
>    the data for every Fedora release. This is already 200kb in
>    size and will grow forever. If we split up all the files
>    so there is only ever one entity (os, hypervisor, device, etc)
>    in each XML file, each file will be smaller in size. This would
>    also let us potentially do database minimization. eg we could
>    provide a download that contains /all/ OS, and another download
>    that contains only non-end-of-life OS.
> 
>  - Should we formalize the specification so that we can officially
>    support other library implementations
> 
>    While libosinfo is accessible from many languages via GObject
>    introspection, some projects are still loathe to consume python
>    libraries backed by native code. eg openstack would really
>    prefer to be able to just pip install a pure python impl.
> 
>    Currently libosinfo library includes some implicit business
>    logic about how you load the database, and dealing with overrides
>    from different files. eg if you have the same OS ID defined in
>    multiple XML files which one "wins". Also which paths are supposed
>    to be considered when loading files. In the future also possibly
>    how to download live updates over the net. It also has logic about
>    how you detect ISO images & install trees from the media data and
>    how to generate kick start files, etc, none of which is formally
>    specified or documented.
> 
>  - How do we provide notifications when updates are available
> 
>    eg, we don't want 1000's of clients checking the libosinfo website
>    daily to download a new database, if it hasn't changed since they
>    last checked. Can we efficiently provide info about database updates
>    so people can check and avoid downloading if it hasn't changed. I
>    have thought about perhaps adding a DNS TXT record that records
>    the SHA256 checksum of the database, so clients can do a simple
>    DNS lookup to check for update availability. This is nice and scalable
>    thanks to DNS server caching & TTLs, avoiding hitting the webserver
>    most of the time.
> 
> Regards,
> Daniel
> -- 
> |: http://berrange.com      -o-    http://www.flickr.com/photos/dberrange/ :|
> |: http://libvirt.org              -o-             http://virt-manager.org :|
> |: http://autobuild.org       -o-         http://search.cpan.org/~danberr/ :|
> |: http://entangle-photo.org       -o-       http://live.gnome.org/gtk-vnc :|
> 
> _______________________________________________
> Libosinfo mailing list
> Libosinfo at redhat.com
> https://www.redhat.com/mailman/listinfo/libosinfo

Regards,
Daniel
-- 
|: http://berrange.com      -o-    http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org              -o-             http://virt-manager.org :|
|: http://autobuild.org       -o-         http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org       -o-       http://live.gnome.org/gtk-vnc :|