[Libosinfo] RFC: Splitting off database into separate package
Daniel P. Berrange
berrange at redhat.com
Fri Jul 24 10:03:10 UTC 2015
Ping, anyone have any thoughts on these general ideas? I'm not
suggesting we do everything at once neccessarily, but I'd like
us to at least figure out a direction to move forwards in...
On Wed, Jul 22, 2015 at 11:46:23AM +0100, Daniel P. Berrange wrote:
> Currently we distribute the database alongside the library code. We have
> long known this was no ideal over the long term, since the database is
> updated much more frequently than the code needs to be, and also people
> wish to be able to consume database updates without updating the code.
>
> As such, I think it is time to look at splitting the database off into
> a separate package and git repository from the code, which we can then
> distribute and release on separate timelines. In fact for the database
> we'd probably just want to make available daily automated tar.gzs from
> git, rather than doing manual releases.
>
> In the new GIT repository I think we'd need to have the following pieces
> of the current codebase
>
> - data/
> - tests/isodata/
> - tests/test-isodetect.c
>
> When doing an automated release, we'd want to run some test to make
> sure the database is syntactically valid & can be successfully loaded,
> as well as the existing iso unit tests.
>
> I think we also need to consider what we need to do to future proof
> ourselves, because once we start distributing the database separately
> from the code, we are making a much stronger committment to supporting
> the current database format long term. From that POV, I think we need
> to consider a few things
>
> - Is XML the format we want to use long term ?
>
> We already ditched XML for the PCI & USB ID databases, in favour of
> directly loading the native data sources because XML was just too
> damn slow to parse. I'm concerned that as we increase the size of
> the database we might find this becoming a more general problem.
>
> So should we do experiments to see if something like JSON or YAML
> is faster to load data from ?
>
> If we want to use a different format, should we do it exclusively
> or in parallel
>
> eg should we drop XML support if we switch to JSON, or should
> we keep XML support and automatically generate a JSON version
> of the database.
>
> - Do we need to mark a database schema version in some manner ?
>
> eg any time we add new attribute or elements to the schema we
> should increment the version number. That would allow the library
> to detect what version data it is receiving. Even though old
> libraries should be fine accepting new database versions, and
> new libraries should be fine accepting old database versions,
> experiance has told me we should have some kind of versioning
> infomation as a "get out of jail free card"
>
> - Should we restructure the database ?
>
> eg, we have a single data/oses/fedora.xml file that contains
> the data for every Fedora release. This is already 200kb in
> size and will grow forever. If we split up all the files
> so there is only ever one entity (os, hypervisor, device, etc)
> in each XML file, each file will be smaller in size. This would
> also let us potentially do database minimization. eg we could
> provide a download that contains /all/ OS, and another download
> that contains only non-end-of-life OS.
>
> - Should we formalize the specification so that we can officially
> support other library implementations
>
> While libosinfo is accessible from many languages via GObject
> introspection, some projects are still loathe to consume python
> libraries backed by native code. eg openstack would really
> prefer to be able to just pip install a pure python impl.
>
> Currently libosinfo library includes some implicit business
> logic about how you load the database, and dealing with overrides
> from different files. eg if you have the same OS ID defined in
> multiple XML files which one "wins". Also which paths are supposed
> to be considered when loading files. In the future also possibly
> how to download live updates over the net. It also has logic about
> how you detect ISO images & install trees from the media data and
> how to generate kick start files, etc, none of which is formally
> specified or documented.
>
> - How do we provide notifications when updates are available
>
> eg, we don't want 1000's of clients checking the libosinfo website
> daily to download a new database, if it hasn't changed since they
> last checked. Can we efficiently provide info about database updates
> so people can check and avoid downloading if it hasn't changed. I
> have thought about perhaps adding a DNS TXT record that records
> the SHA256 checksum of the database, so clients can do a simple
> DNS lookup to check for update availability. This is nice and scalable
> thanks to DNS server caching & TTLs, avoiding hitting the webserver
> most of the time.
>
> Regards,
> Daniel
> --
> |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :|
> |: http://libvirt.org -o- http://virt-manager.org :|
> |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :|
> |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|
>
> _______________________________________________
> Libosinfo mailing list
> Libosinfo at redhat.com
> https://www.redhat.com/mailman/listinfo/libosinfo
Regards,
Daniel
--
|: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org -o- http://virt-manager.org :|
|: http://autobuild.org -o- http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|
More information about the Libosinfo
mailing list