[three]Bean

statscache - thoughts on a new Fedora Infrastructure backend service

May 26, 2015 | categories: fedora, datagrepper, statscache View Comments

We've been working on a new backend service called statscache. It's not even close to done, but it has gotten to the point that it deserves an introduction.

A little preamble: the Fedora Infrastructure team runs ~40 web services. Community ongoings are channeled through irc meetings (for which we have a bot), wiki pages (which we host), and more. Packages are built in our buildsystem. QA feedback goes through lots of channels, but particularly the updates system. There are lots of systems with many heterogenous interfaces. About three years ago, we started linking them all together with a common message bus on the backend and this has granted us a handful of advantages. One of them is that we now have a common history for all Fedora development activity (and to a lesser extent, community activity).

There is a web interface to this common history called datagrepper. Here are some example queries to get the feel for it:

On top of this history API, we can build other things -- the first of which was the release engineering dashboard. It is a pure html/js app -- it has no backend server components of its own (no python) -- it directs your browser to make many different queries to the datagrepper API. It 1) asks for all the recent message types for X, Y, and Z categories, 2) locally filters out the irrelevant ones, and 3) tries to render only the latest events.

It is arguably useful. QA likes it so they can see what new things are available to test. In the future, perhaps the websites team can use it to get the latest AMIs for images uploaded to amazon, so they can in turn update getfedora.org.

It is arguably slow. I mean, that thing really crawls when you try to load the page and we've already put some tweaks in place to try to make it incrementally faster. We need a new architecture.

Enter statscache. The releng dash is pulling raw data from the server to the browser, and then computing some 'latest values' from there to display. Why don't we compute and cache those latest values in a server-side service instead? This way they'll be ready and available for snappy delivery to web clients and we won't have to stress out the master archive DB with all those queries trawling for gems.

@rtnpro and I have been working on it for the past few months and have a nice basis for the framework. It can currently cache some rudimentary stuff and most of the releng-dash information, but we have big plans. It is pluggable -- so if there's a new "thing you want to know about", you can write a statscache plugin for it, install it, and we'll start tracking that statistics over time. There are all sorts of metrics -- both the well understood kind and the half-baked kind -- that we can track and have available for visualization.

We can then plug those graphs in as widgets to the larger Fedora Hubs effort we're embarking on (visit the wiki page to learn about it). Imagine user profile pages there with nice d3.js graphs of personal and aggregate community activity. Something in the style of the calendar of contributions graph that GitHub puts on user profile pages would be a perfect fit (but for Fedora activity -- not GitHub activity).

Check out the code:

At this point we need:

  • New plugins of all kinds. What kinds of running stats/metrics would be interesting?
  • By writing plugins that will flex the API of the framework, we want to find edge cases that cannot be easily coded. With those we can in turn adjust the framework now -- early -- instead of 6 months from now when we have other code relying on this.
  • A set of example visualizations would be nice. I don't think statscache should host or serve the visualization, but it will help to build a few toy ones in an examples/ directory to make sure the statscache API can be used sanely. We've been doing this with a statscache branch of the releng dash repo.
  • Unit/functional test cases. We have some, but could use more.
  • Stress testing. With a handful of plugins, how much does the backend suffer under load?
  • Plugin discovery. It would be nice to have an API endpoint we can query to find out what plugins are installed and active on the server.
  • Chrome around the web interface? It currently serves only JSON responses, but a nice little documentation page that will introduce a new user to the API would be good (kind of like datagrepper itself).
  • A deployment plan. We're pretty good at doing this now so it shouldn't be problematic.
View Comments

PyCon 2015 (Part II)

Apr 21, 2015 | categories: python, fedora, pycon View Comments

I wrote last week about how a few of us from Fedora were at PyCon US in Montreal for the week. It's all over and done with and we're back home now (I got a flu-like bug for the second time this season on the way home...) So, these are just some quick notes on what I did at the sprints!

  • Early on, I ported python-fedora to python3 and afterwards bkabrda picked up the torch and ported fedmsg and the expansive fedmsg_meta module. The one thing standing in the way of full-on python3 fedmsg is the M2Crypto library which will probably not see python3 compatibility anytime soon. Slavek courageously ported half of fedmsg's crypto stack to the python3-compatible cryptography library only to find that it didn't support the other half of the equation. We're keeping those changes in a branch until that gets caught up.

  • The most exciting bit was helping Nolski with his tool that puts fedmsg notifications on the OSX desktop. It totally works. Crazy, right?

  • Later in the week, I helped decause a bit with his new cardsite app. Load it up and let it run for a while. It's yet another neat way to visualize the activity of the Fedora community in realtime.

  • I started a prototype of fedora-hubs which doesn't do much but display little dummy widgets, but it is useful for reflecting on how the architecture ought to work.

  • I wrote some code to get the fedmsg-notify desktop tool to pull its preferences from the FMN service. The changes work, but they required some server-side patches to FMN that are done, but haven't yet been rolled out to production (and we're in freeze for the Beta release anyways..).

    In order to use your FMN preferences, you currently have to set a gsettings value by hand which is unacceptable and gross, but I'm not sure how to present it in the config UI. We can't just go all-in with FMN because there are other distros out there (Debian) which use fedmsg-notify but which don't run their own FMN service. We'll have to think on it and let it sit for a while.

  • Lastly, Bodhi2 saw some good work. (We fixed some bugs that needed to be hammered out before release and we actually have an RPM and installed it on a cloud node! Staging will be coming next once some el7 compat deps get sorted out.)

  • ncohglan introduced us to the authors of kallithea which led to some conversations about pagure and where we can collaborate.

  • I was really glad to meet sijis for fun-time late-night hackery in the hotel lobby.

That's all I can remember. It was a whirlwind, as always. Happy Hacking!

View Comments

PyCon 2015 (Part I)

Apr 12, 2015 | categories: python, fedora, pycon View Comments

A few of us from Fedora are at PyCon US in Montreal for the week. The conference portion is almost over and the sprints start tomorrow, but in the meantime here are some highlights from the best sessions I sat in on:

Some hacking happened in the interstitial periods!

  • I wrote a prototype of a system to calculate and store statistics about fedmsg activity and some plugins for it. This will hopefully turn out to be quite useful for building community dashboards in the future (like a revamped releng dashboard or the nascent fedora-hubs).
  • We ported python-fedora to python3! Hooray!
  • The GSOC deadline really snuck up on us, so Pierre-Yves Chibon and I carved out some time to sit down and go over all the pending applications.

I'm really looking forwards to the sprints and the chance to work and connect with all our upstreams. We'll be holding a "Live From Pycon" video cast at some point. Details forthcoming. Happy Hacking!

View Comments

Karma Cookies, and how to give them

Mar 19, 2015 | categories: fedmsg, fedora, badges View Comments

It took a while to get all the ingredients together, but we baked up a delicious batch of new Fedora Badges and they're fresh out of the oven.

To quote mizmo from the original ticket:

So here's the idea. The FPL has the FPL blessing, right? But, someone did
something really helpful and awesome for me today, with no expectation of
getting anything in return. This person really made my life easier. And I
really wish I could give him something as a token of my appreciation.

So my thought here - maybe everyone in the Fedora project gets an amount of
cookie badges that they can hand out as thank yous to others in the project as
they're getting things done and helping each other out. You can't award one to
yourself, only others. Maybe you get one cookie for every 5 badges you have
earned, so folks get a number of cookies proportional to their achievements in
the system, and if they run out they can replenish their cookies by earning
more badges.

The excellent riecatnor got to work and whipped up some treats:

https://badges.fedoraproject.org/pngs/macaroncookie.png

The last missing piece was a new plugin for zodbot that listens for USERNAME++ in IRC and publishes a new karma message to Fedora Infrastructure's message bus.

That's all done and in place now. You can grant someone karma like this:

| threebean | riecatnor++

And check to see how much karma a given user has like this:

| threebean │ .karma riecatnor
|    zodbot │ threebean: Karma for riecatnor has been increased 1 times

Lastly, there are a handful of restrictions on it. You can only give karma to a particular individual once (although you have an unlimited supply of points to give to the entire Fedora community):

| threebean │ riecatnor++
|    zodbot │ threebean: You have already given 1 karma to riecatnor

You can't give yourself karma:

| threebean │ threebean++
|    zodbot │ threebean: You may not modify your own karma.

You can only give karma to fas users, and you can only give karma if you are a fas user (your irc nick must match your fas username or you must have your ircnick listed in FAS.

There's code in the plugin to allow negative karma, i.e. threebean--, but we have that disabled. Best to stay positive! ;)

Enjoy! As always, if you have questions about this stuff, please do jump into #fedora-apps on freenode and ask away!

http://i.imgur.com/vK7Cl.gif
View Comments

Revisiting Datagrepper Performance

Feb 27, 2015 | categories: fedmsg, datanommer, fedora, datagrepper, postgres View Comments

In Fedora Infrastructure, we run a service somewhat-hilariously called datagrepper which lets you make queries over HTTP about the history of our message bus. (The service that feeds the database is called datanommer.) We recently crossed the mark of 20 million messages in the store, and the thing still works but it has become noticeably slower over time. This affects other dependent services:

  • The releng dashboard and others make HTTP queries to datagrepper.
  • The fedora-packages app waits on datagrepper results to present brief histories of packages.
  • The Fedora Badges backend queries the db directly to figure out if it should award badges or not.
  • The notifications frontend queries the db to try an display what messages in the past would have matched a hypothetical set of rules.

I've written about this chokepoint before, but haven't had time to really do anything about it... until this week!

Measuring how bad it is

First, some stats -- I wrote this benchmarking script to try a handful of different queries on the service and report some average response times:

#!/usr/bin/env python
import requests
import itertools
import time
import sys

url = 'https://apps.fedoraproject.org/datagrepper/raw/'

attempts = 8

possible_arguments = [
    ('delta', 86400),
    ('user', 'ralph'),
    ('category', 'buildsys'),
    ('topic', 'org.fedoraproject.prod.buildsys.build.state.change'),
    ('not_package', 'bugwarrior'),
]

result_map = {}
for left, right in itertools.product(possible_arguments, possible_arguments):
    if left is right:
        continue
    key = hash(str(list(sorted(set(left + right)))))
    if key in result_map:
        continue

    results = []
    params = dict([left, right])
    for attempt in range(attempts):
        start = time.time()
        r = requests.get(url, params=params)
        assert(r.status_code == 200)
        results.append(time.time() - start)

    # Throw away the max and the min (outliers)
    results.remove(min(results))
    results.remove(min(results))
    results.remove(max(results))
    results.remove(max(results))

    average = sum(results) / len(results)
    result_map[key] = average

    print "%0.4f    %r" % (average, str(params))
    sys.stdout.flush()

The results get printed out in two columns.

  • The leftmost column is the average number of seconds it takes to make a query (we try 8 times, throw away the shortest and the longest and take the average of the remaining).
  • The rightmost column is a description of the query arguments passed to datagrepper. Different kinds of queries take different times.

This first set of results are from our production instance as-is:

7.7467    "{'user': 'ralph', 'delta': 86400}"
0.6984    "{'category': 'buildsys', 'delta': 86400}"
0.7801    "{'topic': 'org.fedoraproject.prod.buildsys.build.state.change', 'delta': 86400}"
6.0842    "{'not_package': 'bugwarrior', 'delta': 86400}"
7.9572    "{'category': 'buildsys', 'user': 'ralph'}"
7.2941    "{'topic': 'org.fedoraproject.prod.buildsys.build.state.change', 'user': 'ralph'}"
11.751    "{'user': 'ralph', 'not_package': 'bugwarrior'}"
34.402    "{'category': 'buildsys', 'topic': 'org.fedoraproject.prod.buildsys.build.state.change'}"
36.377    "{'category': 'buildsys', 'not_package': 'bugwarrior'}"
44.536    "{'topic': 'org.fedoraproject.prod.buildsys.build.state.change', 'not_package': 'bugwarrior'}"

Notice that a handful of queries are under one second but some are unbearably long. A seven second response time is too long, and a 44-second response time is way too long.

Setting up a dev instance

I grabbed the dump of our production database and imported it into a fresh postgres instance in our private cloud to mess around. Before making any further modifications, I ran the benchmarking script again on this new guy and got some different results:

5.4305    "{'user': 'ralph', 'delta': 86400}"
0.5391    "{'category': 'buildsys', 'delta': 86400}"
0.4992    "{'topic': 'org.fedoraproject.prod.buildsys.build.state.change', 'delta': 86400}"
4.5578    "{'not_package': 'bugwarrior', 'delta': 86400}"
6.4852    "{'category': 'buildsys', 'user': 'ralph'}"
6.3851    "{'topic': 'org.fedoraproject.prod.buildsys.build.state.change', 'user': 'ralph'}"
10.932    "{'user': 'ralph', 'not_package': 'bugwarrior'}"
9.1895    "{'category': 'buildsys', 'topic': 'org.fedoraproject.prod.buildsys.build.state.change'}"
14.950    "{'category': 'buildsys', 'not_package': 'bugwarrior'}"
12.044    "{'topic': 'org.fedoraproject.prod.buildsys.build.state.change', 'not_package': 'bugwarrior'}"

A couple things are faster here:

  • No ssl on the HTTP requests (almost irrelevant)
  • No other load on the db from other live requests (likely irrelevant)
  • The db was freshly imported (the last time we moved the db server things got magically faster. I think there's something about the way that postgres stores stuff internally that when you freshly import the data, it is organized more effectively. I have no data or real know-how to support this claim though).

Experimenting with indexes

I first tried adding indexes on the category and topic columns of the messages table (which are common columns used for filter operations). We already have an index on the timestamp column, without which the whole service is just unusable.

Some results after adding those:

0.1957    "{'user': 'ralph', 'delta': 86400}"
0.1966    "{'category': 'buildsys', 'delta': 86400}"
0.1936    "{'topic': 'org.fedoraproject.prod.buildsys.build.state.change', 'delta': 86400}"
0.1986    "{'not_package': 'bugwarrior', 'delta': 86400}"
6.6809    "{'category': 'buildsys', 'user': 'ralph'}"
6.4602    "{'topic': 'org.fedoraproject.prod.buildsys.build.state.change', 'user': 'ralph'}"
10.982    "{'user': 'ralph', 'not_package': 'bugwarrior'}"
3.7270    "{'category': 'buildsys', 'topic': 'org.fedoraproject.prod.buildsys.build.state.change'}"
14.906    "{'category': 'buildsys', 'not_package': 'bugwarrior'}"
7.6618    "{'topic': 'org.fedoraproject.prod.buildsys.build.state.change', 'not_package': 'bugwarrior'}"

Response times are faster in the cases you would expect.

Those columns are relatively simple one-to-many relationships. A message has one topic, and one category. Topics and categories are each associated with many messages. There is no JOIN required.

Handling the many-to-many cases

Speeding up the queries that require filtering on users and packages is more tricky. They are many-to-many relations -- each user is associated with multiple messages and a message may be associated with many users (or many packages).

I did some research, and through trial-and-error found that adding a composite primary key on the bridge tables gave a nice performance boost. See the results here:

0.2074    "{'user': 'ralph', 'delta': 86400}"
0.2091    "{'category': 'buildsys', 'delta': 86400}"
0.2099    "{'topic': 'org.fedoraproject.prod.buildsys.build.state.change', 'delta': 86400}"
0.2056    "{'not_package': 'bugwarrior', 'delta': 86400}"
1.4863    "{'category': 'buildsys', 'user': 'ralph'}"
1.4553    "{'topic': 'org.fedoraproject.prod.buildsys.build.state.change', 'user': 'ralph'}"
1.8186    "{'user': 'ralph', 'not_package': 'bugwarrior'}"
3.5525    "{'category': 'buildsys', 'topic': 'org.fedoraproject.prod.buildsys.build.state.change'}"
10.9242    "{'category': 'buildsys', 'not_package': 'bugwarrior'}"
3.5214    "{'topic': 'org.fedoraproject.prod.buildsys.build.state.change', 'not_package': 'bugwarrior'}"

The best so far! That one 10.9 second query is undesirable, but it also makes sense: we're asking it to first filter for all buildsys messages (the spammiest category) and then to prune those down to only the builds (a proper subset of that category). If you query just for the builds by topic and omit the category part (which is what you want anyways) the query takes 3.5s.

All around, I see a 3.5x speed increase.

Rolling it out

The code is set to be merged into datanommer and I wrote an ansible playbook to orchestrate pushing the change out. I'd push it out now, but we just entered the infrastructure freeze for the Fedora 22 Alpha release. Once we're through that and all thawed, we should be good to go.

View Comments

« Previous Page -- Next Page »