[Libosinfo] RFC: Splitting off database into separate package
Christophe Fergeau
cfergeau at redhat.com
Fri Jul 24 14:50:34 UTC 2015
Hi,
On Wed, Jul 22, 2015 at 11:46:23AM +0100, Daniel P. Berrange wrote:
> Currently we distribute the database alongside the library code. We have
> long known this was no ideal over the long term, since the database is
> updated much more frequently than the code needs to be, and also people
> wish to be able to consume database updates without updating the code.
>
> As such, I think it is time to look at splitting the database off into
> a separate package and git repository from the code, which we can then
> distribute and release on separate timelines. In fact for the database
> we'd probably just want to make available daily automated tar.gzs from
> git, rather than doing manual releases.
>
> In the new GIT repository I think we'd need to have the following pieces
> of the current codebase
>
> - data/
> - tests/isodata/
> - tests/test-isodetect.c
>
> When doing an automated release, we'd want to run some test to make
> sure the database is syntactically valid & can be successfully loaded,
> as well as the existing iso unit tests.
>
> I think we also need to consider what we need to do to future proof
> ourselves, because once we start distributing the database separately
> from the code, we are making a much stronger committment to supporting
> the current database format long term. From that POV, I think we need
> to consider a few things
>
> - Is XML the format we want to use long term ?
>
> We already ditched XML for the PCI & USB ID databases, in favour of
> directly loading the native data sources because XML was just too
> damn slow to parse. I'm concerned that as we increase the size of
> the database we might find this becoming a more general problem.
>
> So should we do experiments to see if something like JSON or YAML
> is faster to load data from ?
>
> If we want to use a different format, should we do it exclusively
> or in parallel
>
> eg should we drop XML support if we switch to JSON, or should
> we keep XML support and automatically generate a JSON version
> of the database.
Currently we rely on intltool to handle translation of the database XML files,
gettext seems to be able to handle javascript, I don't know if
this can be used for json files as well. So maybe we'll have to keep the
xml files as a way to manage translations.
> - Do we need to mark a database schema version in some manner ?
>
> eg any time we add new attribute or elements to the schema we
> should increment the version number. That would allow the library
> to detect what version data it is receiving. Even though old
> libraries should be fine accepting new database versions, and
> new libraries should be fine accepting old database versions,
> experiance has told me we should have some kind of versioning
> infomation as a "get out of jail free card"
>
> - Should we restructure the database ?
>
> eg, we have a single data/oses/fedora.xml file that contains
> the data for every Fedora release. This is already 200kb in
> size and will grow forever. If we split up all the files
> so there is only ever one entity (os, hypervisor, device, etc)
> in each XML file, each file will be smaller in size. This would
> also let us potentially do database minimization. eg we could
> provide a download that contains /all/ OS, and another download
> that contains only non-end-of-life OS.
I was about to make the same comment as Zeeshan, GNOME has had issues in
the past with data scattered among too many small files, in general this
is solved by adding a cache file containing a concatenated version of
all the files (possibly pre-parsed to some domain-specific format).
> - Should we formalize the specification so that we can officially
> support other library implementations
>
> While libosinfo is accessible from many languages via GObject
> introspection, some projects are still loathe to consume python
> libraries backed by native code. eg openstack would really
> prefer to be able to just pip install a pure python impl.
>
> Currently libosinfo library includes some implicit business
> logic about how you load the database, and dealing with overrides
> from different files. eg if you have the same OS ID defined in
> multiple XML files which one "wins". Also which paths are supposed
> to be considered when loading files. In the future also possibly
> how to download live updates over the net. It also has logic about
> how you detect ISO images & install trees from the media data and
> how to generate kick start files, etc, none of which is formally
> specified or documented.
This could be nice, but I guess this could come later (possibly at the
same time as the database schema versioning if some database format
changes are needed in order to accomodate these independent
implementations)
> - How do we provide notifications when updates are available
>
> eg, we don't want 1000's of clients checking the libosinfo website
> daily to download a new database, if it hasn't changed since they
> last checked. Can we efficiently provide info about database updates
> so people can check and avoid downloading if it hasn't changed. I
> have thought about perhaps adding a DNS TXT record that records
> the SHA256 checksum of the database, so clients can do a simple
> DNS lookup to check for update availability. This is nice and scalable
> thanks to DNS server caching & TTLs, avoiding hitting the webserver
> most of the time.
This also means more special magic to be implemented by libosinfo
consumers, which is not necessarily an issue. If libosinfo is to
download database updates more or less automatically, we'll need
to make this downloading as safe as possible (https, gpg signature with
a known key ?)
Christophe
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 819 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/libosinfo/attachments/20150724/fba75620/attachment.sig>
More information about the Libosinfo
mailing list