Wednesday 30 March 2016

Pipeline 2016

A write up of my notes: they may or may not make any sense.

Keynote: Jez Humble "What I Learned From Three Years Of Sciencing The Cr*p Out Of Continuous Delivery" or "All about SCIENCE"

Suverys

Surveys are measures looking for latent constructs for feelings and similar - see psychometrics.
Surveys need a hypothesis to test and should be worded carefully.
Consider discriminant and convergent validity.
Test for false positives.

Consider the Westrum toypology.
With 6 axes (rows) scaled across three columns: pathological, bureaucratic, generative you can start spotting connections.

Pathological
Bureaucratic
Generative
Power Oriented
Rule Oriented
Performance Oriented
Low cooperation
Modest cooperation
High cooperation
Messengers shot
Messengers neglected
Messengers trained
Responsibilities shirked
Narrow responsibilities
Risks are shared
Bridging discouraged
Bridging tolerated
Bridging encouraged
Failure leads to scapegoating
Failure leads to justice
Failure leads to inquiry
Novelty crushed
Novelty leads to problems
Novelty implemented

For example "Failure leads to" has three different options: scapegoating, justice or inquiry. Where does your org come out for each question? If they say "It's all Matt's fault" and sack Matt that won't avoid mistakes happening again. Blameless postmortems are important.
IT and aviation are both high-tempo, high consequence environments. They are adaptive complex systems: there is frequently not enough information to make a decision. Therefore reduce the consequences of things going wrong.
In general for surveys, use a Likert type scale - use clearly worded statements on a scale, allowing numerical analysis. See if your questions "load together" (or bucket). Maybe spotting what's gone wrong with some software buckets into notification from outside (customers etc) and notification from inside (alerts etc).
Consider CMV, CMB - common method variance or bias. Look for early versus late respondents.
See https://puppetlabs.com/2015-devops-report for the previous devops survey.
In fact take this year's https://puppetlabs.com/blog/2016-state-devops-survey-here

IT performance

How do you measure it? How do you predict it? It seems that "I am satisfied with my job" is the biggest predictor of organisational performance.
Does your company have a culture of "autonomy, mastery, purpose"? What motivates us? [See Pink]

How do we measure IT performance? Consider lead time, release frequency, time to restore, change failure rate...
Going faster doesn't mean you break things, it actually makes you *more* stable, if you look at the data [citation needed]
"Bi-modal IT" is wrong: watch out for Jez's upcoming blog about "fast doesn't compromise safety"

Do we still want to work in the dark-ages of manual config and no test automation?

We claim we are doing continuous integration (CI) by redefining CI. Do devs merge to trunk daily? Do you have tests? Do you fix the build if it goes red?

Aside: "Surveys are a powerful source of confirmation bias"

Question: Can we work together when things go wrong?

Do you have peer reviewed changes? (Mind you, change advisory boards)

Science again (well, stats)

SEM: structured equation modelling: use this to avoid spurious correlations.

Apparently 25% of people do TDD - it's the lost XP practice. TDD forces you to write code in testable ways: it's not about the tests.

How good are your tests? Consider mutation testing e.g. Ivan Moore's Jester

Change advisory boards don't work. They obviously impact throughput but have negligible impact on stability. Jez suggested the phrase "Risk management theatre".


Ian Watson and Chris Covell "Steps closer to awesome"

They work at Call Credit (used to be part of the Skipton building soc) and talked about how to change an organisation.

Their hypothesis: "You already have the people you need."
"Metal as a service" sneaked a mention, since some people were playing buzz-word bingo.
Question: what would make this org "nirvana"?
They started broadcasting good (and bad) things to change the culture. e.g. moving away from a fear of failure. Having shared objectives helped.

We are people, not resources. "Matrix management" (queue obvious slides)  - not a good thing. Be the "A" team instead. (Or the goonies).

The environment matters. They suggested blowing up a red balloon each time you are interrupted for 15 seconds or more, giving a visual aid of the distractions.

They mentioned "Death to manual deployments" being worth reading.

They said devs should never have access to prod.
You need centres of excellence: peer pressure helps.
They have new bottlenecks: "two speed IT" .... the security team should be enablers not the police.
They mentioned the "improvement kata"
They said you need your ducks in a straight line == a backlog of good stories.

Gary Frost "Financial Institutions Carry Too Much Risk, It’s Time To Embrace Continuous Delivery"

of 51zero.com
Sarbanes-Oxley (SOx) was introduced because of risk in finance. Has it worked? No.
It brought about a segregation of duties and lots of change control review. "runbooks" This is still high risk. There have been lots of breeches from IT departments e.g. Knight Capital, NatWest (3 times).
Why are we still failing, despite these "safety measures"?
We need fully automated testing including security and performance. We need micro-services (and containers), giving us isolation.
Aside; architecture diagrams...! Are they helpful? Are they even correct? Why not automatically generate these too so they are at least correct?

What are the blockers? Silos. Move to collaborative environments.

Look out for new FinTech disruption (start-ups I presume)

Gustavo Elias "How To Deal With A Hot Potato"

He was landed with legacy code that was deeply flawed, had multiple responsibilities and high maintenance costs. In fact he calculated these costs and told management, For example, with downtime for deployment and 40 minutes to restarted calculate the cost at over £500 per day per dev.
How to change this?
  • Re-architect
  • Reach zero downtime
  • Detach from the old release cycle
How?
Re-architect with micro-services and the strangle-vine pattern.
Reach zero downtime with a canary release and blue/green deployment. You need business onside for the extra hardware.
Old release cycle: bamboo plan - but this needs new machines.
In the end, be proud.

Pete Marshall "Achieving Continuous Delivery In A Legacy Environment"

The tech architect at Planday (a shift work app)
C.D. in a legacy environment: and not "chaotic delivery".
Ask the question: "What are you business goals?"
They had DNS load balancing, "interesting stand-ups" (nobody cared), no monitoring.
He started a tech radar: goals to get people on board.
He used a corp screensaver to communicate the pipeline vision.
How easy is your code to build? Do you know what's actually in prod? Can you find the delta?
He changed nant to msbuild.
He became a test mentor, having half hour sessions to increase test coverage.
They had estimation sessions and planning sessions.
Teams started to release on their own schedule with minimal disruption to others. 
Logging, monitoring and alerting helped: look for patterns in the logs. n.b. loggly (though cloud based with no instance in Europe so might be slow)
He mentioned feature toggles (I wondered how he implemented these: please not boolean flags in a database, but enough of my pain), though watch out - you can still get surprises.
He used the strangle pattern.
Don't do loads of things: do a couple of things you can actually measure.
Ask yourself "What's the risk of failure?"

Sally Goble "What do you do if you don't do testing?"

From QA at The Guardian
They previously has a two-week release cycle, with a staging environment and lots of manual testing.
They deployed at 8am on a Wednesday. A big news day delayed the release cycle by a week. 
They couldn't roll back.
They moved to automated tests - perhaps selenium. They were mainly comparing pixels.
Then they threw them out.
So, what does QA do if it doesn't do testing? They now make sure they are "not wrong long." i.e. they can fix things quickly.
They have feature switching, canary releases and monitoring (but avoid noise).
They are not a testing department but a quality department. They can concentrate on other things - like less data so apps don't blow out users' data plans or similar.

Steve Elliott "Measure everything, not just production"

Laterooms: something about badgers.
Tools: log aggregation: elastic stack. Metrics: kibana, grafana. Alerting: icinga(2) [like nagios only prettier]
Previously dev/test was slow, had no investment. They had flaky tests and it was difficult to spot trends.
They moved to instrumentation and tooling in dev.
"Measure ALL the things"
Be aware that dashboard fatigue is a thing.
He pointed us at github
Have lots of metrics but don't used them to be Orwellian. Have data-driven retrospectives. (I once made a graph of who was asking who for code review to reveal cliques in our team - data makes a difference! And pictures more so.) He mentioned that you need to make space for feelings in the retrospectives too.
He suggested mixing up the format to keep retrospectives fresh: consider using http://plans-for-retrospectives.com/index.html

He said he was running sentiment analysis on the tweets he got during his talk. 

He mentioned that Devops Manchester is always looking for speakers.

Summary

I'm so glad I went. It's useful to see people talking about their successes (and failures) and to reflect on common themes. "People not resources" struck a deep note for me. I am always inspired when I see people trying to make things better, no matter how hard.
I loved the brief mention of stats in the keynote. The main themes were, of course, about measuring and automating. I will spend time thinking about what else I can measure and how to do stats and present them to non-statisticians in a clear way.
Never under-estimate the power of saying "Prove it" when someone makes a claim.




Saturday 12 March 2016

Random Magic

Have you ever written a unit test with magic numbers in and felt bad? For example, given a C++ class that simulates stock prices, Simulation, you would expect a starting price of zero to stay at zero. Let’s write a test for this using Catch 

TEST_CASE("simulation starting at 0 remains at 0", "[Property]")
{
    const double start_price = 0.0;
    const double drift       = 0.3;//or whatever
    const double volatility  = 0.2;//or whatever
    const double dt          = 0.1;//or whatever
    const unsigned int seed  = 1;  //or whatever
    Simulation price(start_price, drift, volatility, dt, seed);
    REQUIRE(price.update() == 0.0);
}

Oh dear; magic numbers. That sinking feeling when you don’t know or care what values some variables take. The comments hint at the unhappiness. You could write a few more tests cases with other numbers, or use a parameterised approach. Trying every possible double or int would be extreme, and make the unit tests slow. Unit tests should be fast, so we’d best not. We could try some random variables instead of the magic numbers. This might lead to cases that sometimes fail, and unit tests should provide repeatable results, so we’d best not.

Oh dear. If only we had some random magic to help. We need something that allows us to test that properties hold for a variety of cases. We don’t want to hand roll lots of ad-hoc test cases ourselves. If we generate random test cases we need the results to be clearly reported so we know what went wrong if something fails. We need property-based testing. Good news! Haskell got there long before us. 

QuickCheck “is a tool for testing Haskell programs automatically. The programmer provides a specification of the program, in the form of properties which functions should satisfy, and QuickCheck then tests that the properties hold in a large number of randomly generated cases.” [See the manual] You define a property, such as reversing a reversed list gives the original list

prop_RevRev xs = reverse (reverse xs) == xs
          where types = xs::[Int]

Then quickly check it holds for some randomly generated examples.


        Main> quickCheck prop_RevRev
        OK, passed 100 tests.

If a property doesn’t hold, quickCheck reports the case or “counter-example” for which it does not hold. Instead of my initial “example-based” test I can now test my property holds generally. Since the cases are randomly generated rather than exhaustive I may still miss problems, but look how much shorter the code was.

Wait a moment! I was trying to test some C++ and got distracted by Haskell. The good news is ports of QuickCheck exist for various languages. For example, F# has FSCheck  Python has Hypothesis  and, C++ being C++, has various versions. I have tried Legiasoft’s QuickCheck and showed my initial attempts at the #ACCU2015 conference.

A recent blog from Spotify drew my attention to RapidCheck. This claims to integrate with Boost test and Google Test/Mock though I haven't tried it yet. I wonder if I can make it play nicely with Catch. I will report back. Another interesting feature it supports is stateful based testing, based on Erlang’s port of QuickCheck. Since this started with Haskell, many frameworks need *pure* functions. Once in a while, some of us are not quite as pure as we'd like, so I can imagine this being very useful.

I hope this has sparked some excitement about new ways of testing your code. Next time someone asks “Unit tests or integration tests?” say “Yes, and also property-based tests”.