This post is primarily about taking some of the lessons we learned in Fedora QA infrastructure and applying them to some internal Red Hat pipelines as well as generalizing the pattern to other parts of Fedora's infrastructure.
- ResultsDB is a database for storing results. Unsurprising!
- It is a passive system, it doesn't actively do anything.
- It has a HTTP REST interface. You POST new results to it and GET them out.
- It was written by Josef Skladanka of Tim Flink's Fedora QA team.
It was originally written as a part of the larger Taskotron system, but we're talking about it independently here because it's such a gem!
What problems can we solve with it?
In formal Factory 2.0 problem statement terms, this helps us solve the Serialization and Automation problems directly, and indirectly all of the problems that depend on those two.
Beyond that, let's look at fragmentation. The goal of the "Central CI" project in Red Hat was to consolidate all of the fragmentation around various CI solutions. This was a success in terms of operational and capex costs -- instead of 100 different CI systems running on 100 different servers sitting under 100 different desks, we have one Central CI infrastructure backed by OpenStack serving on-demand Jenkins masters. Win. A side-effect of this has been that teams can request and configure their own Jenkins masters, without commonality in their configuration. While teams are incentivized to move to a common test execution tool (Jenkins), there's no common way to organize jobs and results. While we reduced fragmentation at one level, it remains untouched at another. People informally speak of this as the problem of "the fourteen Jenkins masters" of Platform QE.
Beyond Jenkins, some Red Hat PnT DevOps tools perform tasks that are QE-esque but yet are not a part of the Central CI infrastructure. Notably, the Errata Tool (which plays a very similar role to Fedora's Bodhi system) directly runs jobs like covscan, rpmgrill, rpmdiff, and TPS/insanity that are unnecessarily tied to the "release checklist" phase of the workflow. They could benefit from the common infrastructure of Central CI. (The Errata Tool developers are burdened by having to think about scheduling and storing test results while developing the release checklist application. This thing is too big.)
One option could be to attempt to corral all of the various dev and QE groups into getting onto the same platform and configuring their jobs the same way. That's a possibility, but there is a high cost to achieving that level of social coordination.
Instead, we intend to use resultsdb and a small number of messagebus hooks to insulate consuming services from the details of job execution.
Getting data out of resultsdb
Resultsdb, unsurprisingly, stores results. A result must be associated with a testcase, which is just a namespaced name (for example, general.rpmlint). It must also be associated with an item, which you can think about as the unique name of a build artifact produced by some RCM tool: the nevra of an rpm is a typical value for the item field indicating that a particular result is associated with a particular rpm.
Take a look at some examples of queries to the Fedora QA production instance of taskotron, to get an idea for what this thing can store:
- A list of known testcases https://taskotron.fedoraproject.org/resultsdb_api//api/v1.0/testcases
- Information on the depcheck testcase https://taskotron.fedoraproject.org/resultsdb_api//api/v1.0/testcases/dist.depcheck
- All known results for the depcheck testcase https://taskotron.fedoraproject.org/resultsdb_api//api/v1.0/testcases/dist.depcheck/results
- Only depcheck results associated with builds https://taskotron.fedoraproject.org/resultsdb_api//api/v1.0/testcases/dist.depcheck/results?type=koji_build
- All rpmlint results associated with the python-gradunwarp-1.0.3-1.fc24 build https://taskotron.fedoraproject.org/resultsdb_api//api/v1.0/testcases/dist.rpmlint/results?item=python-gradunwarp-1.0.3-1.fc24
- All results of any testcase associated with that same build https://taskotron.fedoraproject.org/resultsdb_api//api/v1.0/results?item=python-gradunwarp-1.0.3-1.fc24
For the Release Checklist
For the Errata Tool problems described in the introduction, we need to:
- Set up Jenkins jobs that do exactly what the Errata Tool processes do today: rpmgrill, covscan, rpmdiff, TPS/Insanity. Ondrej Hudlicky's group is working on this.
- We need to ingest data from the bus about those jobs, and store that in resultsdb. The Factory 2.0 team will be working on that.
- We also need to write and stand up an accompanying waiverdb service, that allows overriding an immutable result in resultsdb. We can re-use this in Fedora to level up the Bodhi/taskotron interaction.
- The Errata Tool needs to be modified to refer to resultsdb's stored results instead of its own.
- We can decommission Errata Tool's scheduling and storage of QE-esque activities. Hooray!
Note that, in Fedora the Bodhi Updates System already works along these lines to gate updates on their resultsdb status. A subset of testcases are declared as required. However, if a testcase is failing erroneously, a developer must change the requirements associated with the update to get it out the door. This is silly. Writing and deploying something like waiverdb will make that much more straightforward.
On expanding this pattern in Fedora
Note also that the fedimg tool, used to upload newly composed images to AWS, currently has no gating in place at all. It uploads everything. While talking about how we actually want to introduce gating into its workflow, it was proposed that it should query the cloud-specific test executor called autocloud. Our answer here should be no. Autocloud should store its results in resultsdb, and fedimg should consult resultsdb to know if an image is good or not. This insulates fedimg's code from the details of autocloud and enables us to more flexibly change out QE methods and tools in the future.
For Rebuild Automation
For Fedora Modularity, we know we need to build and deploy tools to automate rebuilds. In order to avoid unnecessary rebuilds of Tier 2 and Tier 3 artifacts, we'll want to first ensure that Tier 1 artifacts are "good". The rebuild tooling we design will need to:
- Refer to resultsdb to gather testcase results. It should not query test-execution systems directly for the reasons mentioned above.
- Have configurable policy. Resultsdb gives us access to all test results. Do we block rebuilds if one test fails? How do we introduce new experimental tests while not blocking the rebuild process? A constrained subset of the total set of testcases should be used on a per-product/per-component basis to define the rebuild criteria: a policy.
Putting data in resultsdb
- Resultsdb receives new results by way of an HTTP POST.
- In Fedora, the Taskotron system puts results directly into resultsdb.
- Internally, we'll need a level of indirection due to the social coordination issue described above. Any QE process that wants to have its results stored in resultsdb (and therefore be considered in PnT DevOps rebuild and release processes) will need to publish to the unified message bus or the CI-bus using the “CI-Metrics” format driven by Jiri Canderle.
- The Factory 2.0 team will write, deploy and maintain a service that listens for those messages, formats them appropriately, and stores them in resultsdb.
Today marks the end of Sprint #2 for the so-called Factory 2.0 project. So far, it's been a small-ish team with Matthew Prahl and myself (and part-time support from the awesome Michael Bonnet). We know which way we want to go now and we're going to pick up a few more people in the coming weeks.
We write status reports internally, but we're doing work in the open, in Fedora. We should probably bridge that gap and I figured the easiest way to share would be to post our demo videos here as well:
Running fedmsg services on other message bus backends
Storing rpm dependencies in PDC
Storing container dependencies in PDC
Querying dependencies from PDC
Standing up resultsdb with ansible
I'll try my best to post all of our future sprint report videos here. No secrets! The best place to find us is in
#fedora-admin. Happy Hacking!
Recently, over in a pull request review, we found the need to list all of the existing fedmsg topics (to see if some code we were writing would or wouldn't stumble on any of them).
Way back when, we added a feature to datagrepper that would let you list all of the unique topics, but it never worked and was eventually removed.
Here's the next best thing:
#!/usr/bin/env python from fedmsg_meta_fedora_infrastructure import ( tests, doc_utilities, ) from fedmsg.tests.test_meta import Unspecified classes = doc_utilities.load_classes(tests) classes = [cls for cls in classes if hasattr(cls.context, 'msg')] topics =  for cls in classes: if not cls.context.msg is Unspecified: topics.append(cls.context.msg['topic'] .replace('.stg.', '.prod.') .replace('.dev.', '.prod.')) # Unique and sort topics = sorted(list(set(topics))) import pprint pprint.pprint(topics)
Like in previous years, a few of us from Fedora were at PyCon US in Portland Oregon for the week. The conference is over now (I'm sticking around for a day to explore the Pacific Northwest). Here are some of the highlights from the talks attended and the community sprint days:
Talks worth checking out
- K Lars Lohn's final keynote was out of control. None of us were ready for it. It wasn't even about python but I know everyone loved it. Parisa Tabriz's keynote on hacker mindset was very good (she's the "security princess" at Google) and Guido van Rossum's keynote on the state of python wandered off into an interesting autobiography about what made Python possible. If you're interested in software architecture, the Wednesday morning keynote on Plone and Zope by @cewing was an interesting overview of the evolution of that stack.
- Alex Gaynor's talk on automation for dev groups, cleverly titled "The cobbler's children have no shoes" was close to my heart. There was a salient point in the Q&A section about how while we often focus on automating workflows that are somehow problematic, sometimes that problem is a deeper social one. Automation can surface and inadvertantly exacerbate a tension between groups that have friction.
- For web development stuff, three talks are worth highlighting: @callahad of Mozilla (who is an awesome person) gave a talk on new mobile web technologies, Service Workers, Push, and App Manifests. It's worth a listen for people in the Fedora and Red Hat infrastructure ecosystem. @dshafik of Akamai gave a super interesting talk on HTTP/2 and the consequences for web devs. The short of it is that we have all these hacks in place that have become "best practice" over the years (sprite sheets, compressed and concatenated assets, bloated collection REST responses), none of which are necessary or desirable when we have HTTP/2 ready to go server-side. Sixty percent of browsers are ready to consume HTTP/2 apps and its all backwards compatible. Definitely worth looking into. Last but not least, if you do wev development, check out Sumana's talk titled "HTTP can do that?!" which goes over how to get the most out of HTTP/1 (something we've not always been the best at doing) -- very engaging.
- If you watch any of the talks here, check out Larry Hasting's talk on removing python's global interpreter lock. It's important if you use the language, deal with performance issues, and especially if you write C extensions. If none of those are you -- the details of the interpreter implementation are still super interesting. #gilectomy
- Of course the hallway track was the most valuable. I had good talks with @goodwillbits, @lvh, @sils1297, and too many others to mention.
For the community code sprints, I hacked with a couple other people on the test suite for koji which is the build system used by Fedora and many other RPM-based Linux distributions. We have a lot of web services and systems that go into producing the distro. Koji was one of the first that was written back in the day and it is starting to show its age. Getting test coverage up to a reasonable state is a pre-requisite for further refactoring (porting to python3, making it more modular, faster, etc..).
I'm moving! Today is my first day on the Release Engineering Development team (RED team) of the PnT DevOps organization at Red Hat. After I get my bearings, I'll be working on "Factory 2.0" which, while still quite a nebulous and undefined thing, boils down to focusing on the next-generation build and release pipeline for RHEL and other Red Hat products. What's cool about this is that, since it's future-facing work, I get to focus on how to knit the effort with what's been going on in Fedora releng. We'll have lots to talk about and hack on, I'm sure.
« Previous Page -- Next Page »