Just yesterday, we deployed the 3.0 version -- a rewrite -- of the fedora-packages webapp (source). For years now, it has suffered from data corruption problems that stemmed from multiple processes all fighting over resources stored on a gluster share between the app nodes. Gluster's not to blame. It's that too many things were trying to be "helpful" and no amount of locking would seem to solve the problem.
We have lots of old, open tickets about various kinds of data being missing from the webapp. Those are hopefully all resolved now. Please use it and file new tickets if you notice bugs. Patience appreciated.
Let's take a look at the internal architecture of the app. It's a cool idea. It doesn't really have any data of its own, but it is a layer on top of our other packaging apps; it just re-presents all of their data in one place. This is the "microservices" dream?
Here we have a diagram of the system as it was originally written in its "2.0" state.
HTTP requests come in to the app either for some initial page load or for some kind of subsequent ajax data. The app hands control off to one of two major subsystems -- a "widgets" controller that handles rendering all the tabs, and a "connectors" dispatcher that handles gathering and returning data. The widgets themselves actually re-use the connectors under the hood to prepare their initial data.
More complicated than that
First, there are only three widgets/connectors depicted above, but really there are many more (a search connector, a bugzilla connector, etc..). Some of them were written, but never included any place in the app (in the latest pass through the code, I found an unused TorrentConnector which returned data about Fedora torrent downloads!).
Note that over the last few years, the widgets subsystem has remained largely unchanged. It is a source of technical debt, but it hasn't been the cause of any major breakages, so we haven't had cause to touch it. The widgets have metaclasses under the hood, can be nested into a hierarchy, and can declare js/css resource dependencies in a tree. It's pretty massive -- all on the server-side.
Lastly, there are (were) a variety of cronjobs (not depicted) which would update local data for a subset of the connectors. Notably, there was a yum-sync cronjob that would pull the latest yum repodata down to disk. There was a cronjob that would pull down all the latest koji builds "since the last time it ran". Another would crawl through the local yum repos and rebuild the search index based on what it thought was in rawhide.
Just... keep all that in mind.
Focus on the connectors
Here's a simpler drawing:
So, when it was first released, this beast was too slow. The koji connector would take forever to return.. and bugzilla even longer.
To try and make things snappy, I added a cache layer internally, like this:
The "connector middleware" and the widget subsystem would both use the cache, and things became somewhat more nice! However, the cache expiry was too long, and people complained (rightly) that the data was often out of date. So, we reduced it and had the cache expire every 5 minutes. But.. that defeated the whole point. Every time you requested a page, you were almost certainly guaranteed that the cache would already be expired and you'd have to wait and wait for the connectors to do their heavy-lifting anyways.
That's when (back in 2013), I got this idea to introduce an asynchronous cache worker, that looked something like this.
If you requested a page and the cache data was too old, the web app would just return the old data to you anyways, but it would also stick a note in a redis queue telling a cache worker daemon that it should rebuild that cache value for the next request.
I thought it was pretty cool. You could request the page and sometimes get old data, but if you refreshed shortly after that you'd have the new stuff. Pages that were "hot" (being clicked on by multiple people) appeared to be kept fresh more regularly.
However, a page that was "cold" -- something that someone would visit once every few months -- would often present horribly old information to the requester. People frequently complained that the app was just out of sync entirely.
To make matters worse, it was out of sync entirely! We had a separate set of issues with the cron jobs (the one that would update the list of koji builds and the one that would update the yum cache). Sometimes, the webapp, the cache worker, and the cronjob would all try to modify the same files at the same time and horribly corrupt things. The cronjob would crash, and it would never go back to find the old builds that it failed to ingest. It was a mess.
The latest rewrite
Two really good decisions were made in the latest rewrite:
First, we dispensed entirely with the local yum repos (which were the resources most prone to corruption). We moved that out to an external network service called mdapi which is very cool in its own right, but it makes the data story much more simple for the fedora-packages app.
Second, I replaced the reactive async cache worker with an active event-driven cache worker. Instead of updating the cache when a user requests the page, we update the cache when the resources change in the system we would query. For example, when someone does a new build in the buildsystem, the buildsystem publishes a message to our message bus. The cache worker receives that event -- it first deletes the old JSON data for the builds page for that package in the cache, and then it calls the KojiConnector with the appropriate arguments to re-fill that cache value with the latest data.
We turned off expiration in the cache all-together so that values never expire on their own. The outcome here is that the page data should be freshly cached before anyone requests it -- active cache invalidation.
With those two changes, we were able to kill off all of the cronjobs.
Some additional complications: first, the cache worker also updates a local xapian database in response to events (in addition to the expiration-less cache), but it is the only process doing so and so can hopefully avoid further corruption issues.
Second, the bugzilla connector can't work like this yet because we don't yet have bugzilla events on our message bus. Zod-willing, we'll have them in January 2016 and we can flip that part on. The bugs tab will be slower than we like until then. UPDATE: We got bugzilla on the bus at the end of March, 2016.
We're building the fedora-hubs backend with the same kind of architecture (actively-invalidated cache of tough-to-assemble page data), so, we get to learn practical lessons here about what works and what doesn't.
Do hit us up in #fedora-apps on freenode if you want to help out, chat, or lurk. I'll be cleaning up any loose bugs on this deployment in the coming weeks while starting work on a new pdc-updater project.
All thanks to Abdel Martínez and Matej Stuchlik, we're going to be holding a (virtual) international "Fedora Activity Day" for Python 3 porting, and it is going to be amazing. Save the date -- November 14th and 15th
Things to consider:
- If you haven't heard, 2016 is going to be the year of Python3 on the desktop, so...
- If you don't know what you're doing with Python3 porting, don't sweat it. If you want to learn, come join and we'll try to teach you along the way.
- If you don't know how to submit patches upstream, don't sweat it. If you want to learn, come join and we'll try to teach you along the way.
- If you want to hack with us, add your info to the wiki page. We'll be hanging out in a opentokrtc channel and in #fedora-python on freenode. See the details.
- We have a really cool webapp that Petr Viktorin put together. It tracks the status of different packages in Fedora and upstream so we can coordinate more effectively about what needs to be done.
- If you want to get people in your city together, that can make it more fun. You can join the video chat as a group! The EMEA crew will be online from the Pycon CZ 2015 sprints (cool). There are a couple people from my local Python User Group that want to join in.. although we're still searching for a reasonable place to meet up. I plan to be around starting at 18:00 UTC both days, although I bet EMEA crew will be online much earlier.
Short post: I just discovered the --with-prof option to the nosetests command. It profiles your test suite by using the hotshot module from the Python standard library and it found a huge sore spot in my most frequently run suite. In this pull request we got the fedmsg_meta running 31x faster.
(fedmsg_meta)❯ time $(which nosetests) -x Ran 3822 tests in 270.822s OK (SKIP=1638) ---------------------------------------------------------------------- Success! $(which nosetests) -x 267.30s user 1.32s system 98% cpu 4:33.53 total
(fedmsg_meta)❯ time $(which nosetests) -x Ran 3822 tests in 5.982s OK (SKIP=1638) ---------------------------------------------------------------------- Success! $(which nosetests) -x 3.87s user 0.71s system 52% cpu 8.700 total
That test suite used to take forever. It's the whole reason I wrote nose-audio in the first place!
I wrote last week about how a few of us from Fedora were at PyCon US in Montreal for the week. It's all over and done with and we're back home now (I got a flu-like bug for the second time this season on the way home...) So, these are just some quick notes on what I did at the sprints!
Early on, I ported python-fedora to python3 and afterwards bkabrda picked up the torch and ported fedmsg and the expansive fedmsg_meta module. The one thing standing in the way of full-on python3 fedmsg is the M2Crypto library which will probably not see python3 compatibility anytime soon. Slavek courageously ported half of fedmsg's crypto stack to the python3-compatible cryptography library only to find that it didn't support the other half of the equation. We're keeping those changes in a branch until that gets caught up.
The most exciting bit was helping Nolski with his tool that puts fedmsg notifications on the OSX desktop. It totally works. Crazy, right?
I started a prototype of fedora-hubs which doesn't do much but display little dummy widgets, but it is useful for reflecting on how the architecture ought to work.
I wrote some code to get the fedmsg-notify desktop tool to pull its preferences from the FMN service. The changes work, but they required some server-side patches to FMN that are done, but haven't yet been rolled out to production (and we're in freeze for the Beta release anyways..).
In order to use your FMN preferences, you currently have to set a gsettings value by hand which is unacceptable and gross, but I'm not sure how to present it in the config UI. We can't just go all-in with FMN because there are other distros out there (Debian) which use fedmsg-notify but which don't run their own FMN service. We'll have to think on it and let it sit for a while.
Lastly, Bodhi2 saw some good work. (We fixed some bugs that needed to be hammered out before release and we actually have an RPM and installed it on a cloud node! Staging will be coming next once some el7 compat deps get sorted out.)
I was really glad to meet sijis for fun-time late-night hackery in the hotel lobby.
That's all I can remember. It was a whirlwind, as always. Happy Hacking!
A few of us from Fedora are at PyCon US in Montreal for the week. The conference portion is almost over and the sprints start tomorrow, but in the meantime here are some highlights from the best sessions I sat in on:
- @nnja gave a great talk on technical debt and how it can contribute to a "culture of despair".
- @sigmavirus24's talk on writing tests against python-requests was supremely useful. Using his material, I wrote a patch for anitya that solved an onerous and recurring issue with the test suite.
- Raymond Hettlinger gave a very nice talk on "moving beyond pep8" which was pretty relevant for my team and our code review practices. We write a lot of code which entails doing a lot of code review. His thesis: working in a cosmetic pep8 mindset causes you to often miss the elephant in the room when doing code review. Instructive.
- There was a very good talk on one particular company's experiences with a microservices architecture. It is of special interest to me and our work on the Fedora Infrastructure team with lots of good take-aways. The video of it hasn't been posted yet, but definitely search for it in the coming days.
- I quite disagreed with some of the method presented in the effective python session. No need for wrapper-class boilerplate -- just use itertools.tee(...)!
- Some others: distributed systems theory, interpreting your genome, systems stuff for non-systems people, and ansible were all very nice.
Some hacking happened in the interstitial periods!
- I wrote a prototype of a system to calculate and store statistics about fedmsg activity and some plugins for it. This will hopefully turn out to be quite useful for building community dashboards in the future (like a revamped releng dashboard or the nascent fedora-hubs).
- We ported python-fedora to python3! Hooray!
- The GSOC deadline really snuck up on us, so Pierre-Yves Chibon and I carved out some time to sit down and go over all the pending applications.
I'm really looking forwards to the sprints and the chance to work and connect with all our upstreams. We'll be holding a "Live From Pycon" video cast at some point. Details forthcoming. Happy Hacking!
Next Page »