[Libosinfo] RFC: Splitting off database into separate package

Fri Jul 24 15:47:57 UTC 2015

On Fri, Jul 24, 2015 at 01:26:06PM +0100, Zeeshan Ali (Khattak) wrote:
> Hi Daniel,
> 
> On Wed, Jul 22, 2015 at 11:46 AM, Daniel P. Berrange
> > I think we also need to consider what we need to do to future proof
> > ourselves, because once we start distributing the database separately
> > from the code, we are making a much stronger committment to supporting
> > the current database format long term. From that POV, I think we need
> > to consider a few things
> >
> >  - Is XML the format we want to use long term ?
> >
> >    We already ditched XML for the PCI & USB ID databases, in favour of
> >    directly loading the native data sources because XML was just too
> >    damn slow to parse. I'm concerned that as we increase the size of
> >    the database we might find this becoming a more general problem.
> >
> >    So should we do experiments to see if something like JSON or YAML
> >    is faster to load data from ?
> 
> I was thinking maybe we could have our own binary format that we
> transform the database to on loading and then have a caching mechanism
> in place so it's only done once per installation per version. There
> will be a 1-1 correspondence between xml and generated files so that
> cache is always used if corresponding xml file has not changed in a
> new version.

I guess you're thinking something like the way gobject introspection
works, where there's the .gir master XML file compiled into the
.typelib binary file for efficient access.

That's certainly an option to consider, though if we can figure out
a better source database format / structure that improves performance
it would be nice to avoid the extra complexity of having two formats.

> >  - Should we restructure the database ?
> >
> >    eg, we have a single data/oses/fedora.xml file that contains
> >    the data for every Fedora release. This is already 200kb in
> >    size and will grow forever. If we split up all the files
> >    so there is only ever one entity (os, hypervisor, device, etc)
> >    in each XML file, each file will be smaller in size. This would
> >    also let us potentially do database minimization. eg we could
> >    provide a download that contains /all/ OS, and another download
> >    that contains only non-end-of-life OS.
> 
> Or we could simply put end-of-life OS into separate xml files? Having
> a separate xml file for each os entry would imply loads of files and
> I/O performance at load time might become an issue.

One reason for separate files per entity is to make it easier to load
a subset of the database.

Currently, we have multiple files in our DB and entities can reference
other entities. The loader has no idea which file each entity is in,
so to resolve all the entity references, it has no option but to load
the entire database every time, even if we only want information about
a single operating system. This is a large part of where our performance
hit comes from and this will only get worse as our DB gets bigger :-(

If we re-defined the database so that there was a specifically required
file naming convention / dir layout, it would be possible to load the
minimal set of entities required to answer the question you have.

eg if you want to ask what block devices are supported by Fedora 20,
we would only need to load the handful of files that are referenced
by the Fedora 20 os definition. This would avoid performance problems
no matter how large our database gets, without needing a binary
format.

Of course the curent libosinfo API doesn't make such an approach to
loading possible today, but we could extend the loader to allow for
this. Also possible future non-libosinfo based loaders could like
it for similar reasons.

> >  - Should we formalize the specification so that we can officially
> >    support other library implementations
> >
> >    While libosinfo is accessible from many languages via GObject
> >    introspection, some projects are still loathe to consume python
> >    libraries backed by native code. eg openstack would really
> >    prefer to be able to just pip install a pure python impl.
> 
> Sure but i'd also keep this very low priority nice-to-have item.

FWIW, it is a pretty strong desire / priority from the openstack
side currently, so not sure we can ignore it as a low priority item
for very long, or they may well fork libosinfo and do it themselves:-(

> >  - How do we provide notifications when updates are available
> >
> >    eg, we don't want 1000's of clients checking the libosinfo website
> >    daily to download a new database, if it hasn't changed since they
> >    last checked. Can we efficiently provide info about database updates
> >    so people can check and avoid downloading if it hasn't changed. I
> >    have thought about perhaps adding a DNS TXT record that records
> >    the SHA256 checksum of the database, so clients can do a simple
> >    DNS lookup to check for update availability. This is nice and scalable
> >    thanks to DNS server caching & TTLs, avoiding hitting the webserver
> >    most of the time.
> 
> If it would work, sounds great!

Regards,
Daniel
-- 
|: http://berrange.com      -o-    http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org              -o-             http://virt-manager.org :|
|: http://autobuild.org       -o-         http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org       -o-       http://live.gnome.org/gtk-vnc :|