[Libosinfo] RFC: Splitting off database into separate package
Zeeshan Ali (Khattak)
zeeshanak at gnome.org
Fri Jul 24 12:26:06 UTC 2015
Hi Daniel,
On Wed, Jul 22, 2015 at 11:46 AM, Daniel P. Berrange
<berrange at redhat.com> wrote:
> Currently we distribute the database alongside the library code. We have
> long known this was no ideal over the long term, since the database is
> updated much more frequently than the code needs to be, and also people
> wish to be able to consume database updates without updating the code.
>
> As such, I think it is time to look at splitting the database off into
> a separate package and git repository from the code, which we can then
> distribute and release on separate timelines. In fact for the database
> we'd probably just want to make available daily automated tar.gzs from
> git, rather than doing manual releases.
That would be nice indeed and I have always wished for such a system.
> When doing an automated release, we'd want to run some test to make
> sure the database is syntactically valid & can be successfully loaded,
> as well as the existing iso unit tests.
Indeed.
> I think we also need to consider what we need to do to future proof
> ourselves, because once we start distributing the database separately
> from the code, we are making a much stronger committment to supporting
> the current database format long term. From that POV, I think we need
> to consider a few things
>
> - Is XML the format we want to use long term ?
>
> We already ditched XML for the PCI & USB ID databases, in favour of
> directly loading the native data sources because XML was just too
> damn slow to parse. I'm concerned that as we increase the size of
> the database we might find this becoming a more general problem.
>
> So should we do experiments to see if something like JSON or YAML
> is faster to load data from ?
I was thinking maybe we could have our own binary format that we
transform the database to on loading and then have a caching mechanism
in place so it's only done once per installation per version. There
will be a 1-1 correspondence between xml and generated files so that
cache is always used if corresponding xml file has not changed in a
new version.
> - Do we need to mark a database schema version in some manner ?
>
> eg any time we add new attribute or elements to the schema we
> should increment the version number. That would allow the library
> to detect what version data it is receiving. Even though old
> libraries should be fine accepting new database versions, and
> new libraries should be fine accepting old database versions,
> experiance has told me we should have some kind of versioning
> infomation as a "get out of jail free card"
Yeah, although i'd do this as last part of this mega project.
> - Should we restructure the database ?
>
> eg, we have a single data/oses/fedora.xml file that contains
> the data for every Fedora release. This is already 200kb in
> size and will grow forever. If we split up all the files
> so there is only ever one entity (os, hypervisor, device, etc)
> in each XML file, each file will be smaller in size. This would
> also let us potentially do database minimization. eg we could
> provide a download that contains /all/ OS, and another download
> that contains only non-end-of-life OS.
Or we could simply put end-of-life OS into separate xml files? Having
a separate xml file for each os entry would imply loads of files and
I/O performance at load time might become an issue.
> - Should we formalize the specification so that we can officially
> support other library implementations
>
> While libosinfo is accessible from many languages via GObject
> introspection, some projects are still loathe to consume python
> libraries backed by native code. eg openstack would really
> prefer to be able to just pip install a pure python impl.
Sure but i'd also keep this very low priority nice-to-have item.
> - How do we provide notifications when updates are available
>
> eg, we don't want 1000's of clients checking the libosinfo website
> daily to download a new database, if it hasn't changed since they
> last checked. Can we efficiently provide info about database updates
> so people can check and avoid downloading if it hasn't changed. I
> have thought about perhaps adding a DNS TXT record that records
> the SHA256 checksum of the database, so clients can do a simple
> DNS lookup to check for update availability. This is nice and scalable
> thanks to DNS server caching & TTLs, avoiding hitting the webserver
> most of the time.
If it would work, sounds great!
--
Regards,
Zeeshan Ali (Khattak)
________________________________________
Befriend GNOME: http://www.gnome.org/friends/
More information about the Libosinfo
mailing list