BuontempoConsulting

Tuesday 23 April 2024

ACCU 2024 trip report

I went to the ACCU conference this year. I was speaking on the last day, which I managed to put out of my mind for a while and went to several other sessions.

We started with a keynote from Herb Sutter about safety in C++. He's spoken and blogged about this before, but this talk pulled together several strands and, at least to me, emphasised ways we can check for potential problems.

For the next talk, we split into tracks so I had to make a choice. I went to listen to Nina Ranns talking about Embracing CTAD. As you probably know, we can say

std::vector items {1, 2, 3};

without having to spell out the <int> part, due to compile time argument deduction (CTAD). Nina used something similar as a starting point, then delved into lots of more general cases, writing your own deduction guides and why some things are not supported. For example, it's hard to decide what would we mean for say a pair of two types when we specify only one of them. I came away with lots to think about. Often a simple start can progress on to so many interesting things to think about.

I went to Steve Love's C# Performance: Demons, Wizards, Warriors, Auditors session next. He had some C++ towards the end, having started with lots of C# examples. He simply put small structs into a hash set in various ways. Again, a nice simple idea to start with from which we learnt lots. He showed the C# could be lots quicker than the C++ when populating the container. I want to find a moment to find out what's going on there. All very interesting.

Inbal Levi's KEYNOTE: Welcome to the meta::[[verse]]! was an introduction to reflection, which is coming up in C++26. It was a nicely done talk, giving an introduction if you've not seen this before. Watch this space.

I then went to Jutta Eckstein's Creating Better Systems by Focusing on Sustainability session. She started by giving background information and defining terms, after a short discussion with the attendees about what they thought about when sustainability was mentioned. We then filled in a survey and discussed the results. It's very easy to think big, and worry about how to change everything all at once, but this helped to focus on some small things we can do. Using less RAM or powering off machines that aren't being used will make a difference. Again, I came away with lots to think about. Jutta's website has a link to her survey. I believe we did a shorter version at ACCU. It's a great starting place for a discussion.

In the afternoon, I went to Rog Orr's Linkers and the One Definition Rule. This is an old idea, but he still managed to find various new ways to break code. He also gave a great explanation of what the linker is up to and why and how. I will have to re-watch this when the talk is on YouTube, and maybe he'll write this up for Overload one day.

On Friday Laura Savino's KEYNOTE: Coping with Other People’s Code covered various ways of coping with difficulties. No one specific thing comes back to mind now, because my mind wandered around lots of similar situations and code bases I have been in. I suggest listening to this when it's on YouTube to give yourself space to reflect.

I went to Kevlin Henney's The Data Abstraction Talk session next. Trying to pull on what abstraction really means is always good. We had various implementations of a stack, again a nice simple place to start which leads to many thoughts and questions.

After lunch, I went to listen to Mathieu Ropert's Data Oriented Design and Entity Component System Explained. I have listened to people talking about structure of arrays versus arrays of structures several times before, and the approach taken can have a performance impact. His angle was how "Entity component system " (ECS) - a way to manage struct of arrays for fields in a Thing, is often suggested as a way to solve almost anything. The introduction was very useful and has clarified several terms for me, but whether ECS will really solve all your problems remains to be seen.

On the last day, I partly went to see Matthew Jones talk about How to delete code, because I was in that room after the coffee break, so I wanted to remind myself where it was. However, I liked the idea of the talk. Matthew chatted to me before hand saying he hasn't spoken for years and was surprised his talk was accepted. I am so pleased I went. It was really a talk about refactoring. Though many specifics referred to C++, the contents were applicable to any code base. How do you decide which code to deal with? When do you refactor? Who should do this? Many other questions... and some answers too.

I then gave my talk: An introduction to swarm intelligence algos (Swarm your way out of a paper bag). My slides got stuck in presenter mode, and a friend said to me afterwards, "You're talks are always chaotic, but not in a bad way", or words to that effect. I had embedded some mp4s in my PowerPoint, so I could just run those rather than switch to a demo, but they refused to run in presenter mode. Oh well. That meant I ran various demos afterwards, but this allowed the audience to choose what they wanted to see. I covered the general gist of Swarm algorithms, which can be used to find the "best" solution to a problem, provided you can encode the problem and define best, or at least better. I specifically walked through the "Cat swarm algorithm." I like trying out various machine learning algorithms and implementing them from scratch. It provides a way to practice coding and always leaves me with various things to think about, usually how to I make my code neater so it fits on a slide.

I then went to Andreas' Weis' C++ Modules - Getting Started Today talk. I've listened to various talks before, mostly by Dani Engert. I have not got round to trying modules yet. Andreas started simply, but we still hit some gnarly parts soon enough. For now, I think I will watch from the slide lines. Modules will be really useful, I just don't have the bandwidth to play with them yet.

Finally, the closing keynote was KEYNOTE: Learning is teaching is sharing by Björn Fahller. This was a positive ending to a great conference. He talked about flying, and fighter jets. In fact, we'd already had mention of nuclear submarines at a couple fo other talks. Weird, emergent themes. Björn's emphasis was more on the "API" to a very basic fighter plane, but in the context of helping others, and sharing knowledge. He also encouraged us to be humble and keep learning and sharing.

Sometimes ACCU gets described as a C++ conference, but it's more than that. There are usually talks about a couple or more other languages, as well as refactoring, testing and more general coding related topics. It's a great place for me to catch up with friends as well as meet new people.

I plan to revisit this and add some photos at some point and add the links to the talks as they turn up on YouTube. If you did a conference write up, feel free to add a link in the comments.

Friday 22 March 2024

Randomness

I gave a talk called "What is a Random number and why should I care?" a few times last year. It evolved each time, but I was fundamentally trying to decide whether I could define "randomness" properly.

My first attempt was at ACCU. Unfortunately, in a surprising turn of events, my laptop died, but I did manage to give the talk using a pdf version on someone else's laptop. I got other chances at other conferences, for example MeetingCpp, so you can watch if you are interested.

The second part of my title - why should I care - is easier to answer than the first. Without randomness, however we define this, outcomes are predictable, which is boring. For example, letting some blobs march up the screen at the same pace gives a very predictable outcome:

If each blob behaves randomly, more interesting things will happen, so we can watch them race.

Now, we can answer the first part of the question. What is a random number? There is no such thing as a random number, but a sequence of numbers might be random. If you can predict what comes next the numbers are not random, but if you can't that proves nothing. For example, what digit comes next in this sequence?

15926535897932384

It's a 6, because these are the digits of π, starting at the fourth digit. If you spotted that well done.

How do you generate random numbers on a computer? Most of the time, we use pseudo-random numbers. These are not random at all. They often use a congruence relation, generating a new number based on a previous number. The simplest is a multiplicative linear congruential generator

x_i+1=Ax_imodM

where A is a constant and M is a prime number. The first value of x is a seed, and using the same seed generates the same sequence of numbers. There are various other generators, some more complicated than others.

I still can't define random but using pseudo-random numbers is good enough for some machine learning applications or games. For example see my Swarm blog post, and I'll be talking about swarm algos at ACCU this year.

My first book, Genetic Algorithms and Machine Learning for Programmers, shows how to build some simulations and models using stochastic or (psuedo) random functions, and my latest book, Learn C++ by Example, aimed at helping people get back up to speed with C++, also has several simple games, like a higher/lower card game. To make the games fun, you get lots of practice using C++'s random numbers. The C++ book is currently on pre-order at Amazon, or you can buy it directly from the publishers.

Monday 31 August 2020

Mutant algorithms

The word "algorithm" has caused a storm in recent news in the UK. Due to COVID-19 school children were not able to sit their exams. This left 16 and 18 year olds waiting to see how they would be assessed, and had obvious implications for their academic or career futures. As you may know, the grades were awarded based on an "algorithm", which our Prime Minister later described as mutant. According to the BBC news, he said "'Mutant algorithm' caused exam chaos." This begs the question, what does our PM think a mutant algorithm is?

The news in the UK has talked generally about the algorithm's inputs being course work and teacher's estimated grades. These are "mutated" (or adjusted) by the algorithm to take into account a school's performance over the last three years. This means schools whose pupils sometimes struggle are more likely to be down-graded. The precise details are buried in a 319 page report. Feel free to read it all and report back. TL;DR; Private and public schools tend to get higher grades than government run schools, so poorer pupils tended to get down-graded and richer pupils did not. Some form of mutation, or even perversion, perhaps of justice, but not in the algorithm.

Now, some algorithms do use mutation. In fact genetic algorithms rely on mutation to seek out new solutions to problems. This is guided by a fitness function, to check the "mutant algorithm" is doing what's required. You can test such algorithms to see what they do, and keep an eye on them as they run to check they are heading the right way. You frequently spend a long time tuning parameters to get better results. This, on the face of it, has nothing whatsoever to do with the "mutant algorithm" our PM was talking about.

There has also be a hint of slur on the programmers who wrote the algorithm, suggesting the idea was good and proper but the naughty programmers took it upon themselves to do something completely different that got out of hand, like a Marvel movie. Think Magneto (naughty programmers) versus Charles Francis Xavier (sensible people like, our PM? Go figure). I am sick of programmer bashing and the general misunderstanding of algorithms.

Where a genetic algorithm uses mutation, or a Monte Carlo simulation uses random numbers as input, it is still possible to test the algorithm is doing what you require. Programmers should never abdicate responsibility for what they have built. However, it is highly irresponsible of the news to allow propaganda and misrepresentations to flourish like this.

A while ago, the Imperial College model for COVID-19 was open-sourced. At the time many people raised bug reports against it. One rumour suggested that running it twice with the same seed for the pseudo-random numbers would produce different results. Now, that might be described as a "mutant algorithm", but we'd usually describe this as buggy code. I don't believe our PM has the technical know-how to spot buggy code, but I'm willing to help him out if he wants. I'm also willing to be interviewed by the BBC to explain some of these technical issues in more detail, if they are interested. Or I could find other technical people who could equally well help out.

DM me.

https://twitter.com/fbuontempo

Sunday 29 March 2020

Can a decision tree tell us about wine categories?

I previously wrote an overview showing how decision trees work: http://buontempoconsulting.blogspot.com/2019/07/decision-trees-for-feature-selection.html

This time, let's build a decision tree with some data. There are many freely available data sets used to explore machine learning, such as the Iris dataset, in the UCI repository.

So let's try another one. The so-called wine dataset. This has three types of wine, with 13 attributes. Though many blogs list the attributes, I have been unable to find out what these three mystery types of wine are. They are three different Italian cultivars, but I have no idea what.

Rather than concentrating on building a decision tree to accurately categorise the wine, giving us a way to predict the type of another wine based on some or all of the 13 attributes, let's build a tree and see what it says.

These data sets are so common, they can be loaded directly from many machine learning packages, such as the python module sklearn. This also has a DecisionTreeClassifier.

So,

from sklearn.datasets import load_wine
X = data.data
y = data.target
estimator = DecisionTreeClassifier(max_depth=2)
estimator.fit(X, y)

We asked for a maximum depth of 2, otherwise it makes a tree as deep (or high) as required to end up with leaves that are "pure" (or as pure as possible). In this case each is the same category of wine. Limiting the depth means it won't get as deep, or wide. But the first few layers will still show us which attributes are used to split up the data.

I say, "show", but we need to see the tree it's made. There are various ways to do this, but I'll use this:

from sklearn import tree
from IPython.display import SVG
from graphviz import Source
from IPython.display import display

graph = Source(tree.export_graphviz(estimator, out_file=None
, feature_names=labels, class_names=['0', '1', '2']
, filled = True))
display(SVG(graph.pipe(format='svg')))

Unfortunately, I've had to stick with class names, i.e. wine categories, of 0, 1 and 2, because I have no idea what they really are.

This generates the following picture:

The first line tells you the attribute and the cut off point chosen. For example, any wine with proline less than or equal to 755 goes down the left branch. The gini index is the measure used to decide which attribute or feature to split on. If you look up the decision tree classifier, you'll find other measures to try. The samples tell you how many at that node. We start with 178 wines, with 71 in class 1, with fewer in the other classes, so it reports class 1 at the first node.

For proline less than 755, we have 111 samples, still mostly in class 1. For proline greater than 755 we have 67 samples, mostly in class 0. These 67 samples can then be split on flavanoids. Anything less than 2.165 is class 2, according to this tree. Anything greater is class 0. We do have some class 2 wine on the left-most branch as well, however, I had a brief wander round the internet to read about flavanoids in wine.
Wikipedia says

In white wines the number of flavonoids is reduced due to the lesser contact with the skins that they receive during winemaking.

Is class 2 white wine? Who knows. It could be. The decision tree made this stand out far more clearly than looking directly at the input data.

I've put the code in a gist if you want to play around with it:
https://gist.github.com/doctorlove/bf6e42658d5806a61669a844b885983b

I think I've included everything here though.

I was planning on giving this as a lightning talk at the ACCU conference, but since it was cancelled this year, because of COVID-19, I wrote this short blog instead. If you can figure out what the types of wine are, get in touch.

Thursday 19 September 2019

Swarm algorithms

I wrote a book about about genetic algorithms and machine learning. You can buy it here.

Apart from genetic algorithms and other aspects of machine learning, it includes some swarm algorithms. Where a genetic algorithm mixes up potential solutions, by merging some together, and periodically mutates some values, swarm algorithms can be regarded as individual agents collaborating, each representing a solution to a problem, They can work together in various ways, giving rise to a variety of swarm algorithms.

The so-called particle swarm algorithm can be used to find optimal solutions to problems. It's commonly referred to as a particle swarm optimisation, or PSO for short. PSO is often claimed to be based on the flocking behaviour of birds. Indeed, if you get the parameters right, you might see something similar to a flock of birds. PSO are similar to colony algorithms, which are also nature inspired, and also having agents collaborating to solve a problem.

Suppose you have some particles in a paper bag, say somewhere near the bottom. If they move about at random, some might get out of the bag in the end. If they follow each other, they might escape, but more likely than not, they'll hang round together in a gang. By providing a fitness function to encourage them, they can learn, for some definition of learn. Each particle can assess where it is, and remember the better places. The whole swarm will have a global best too. To escape a paper bag, we want the particles to go up. By inspecting the current (x, y) position, the fitness score can be the y-value. The bigger, the better. For real world problems, there can be many more than two dimensions, and the fitness function will require some thought.

The algorithms is as follows:

Choose n

Initialize n particles randomly

For a while:

Update best global position

Move particles

Update each particle's best position and velocity

The particles' personal bests and the overall global best give the whole swarm a memory, of sorts. Initially, this is the starting position for each particle. In addition to the current position, each particle has a velocity, initialised with random numbers. Since we're doing this in two dimensions, the velocity has an x component, and a y component. To move a particle, update each of these by adding the velocity, v, in that direction to the current position:

x_t+1 = x_t + v_x,t

y_t+1 = y_t + y_x,t

_{Since the velocity starts at random, the particles move in various different directions to begin with. The trick comes in when we update the velocity. There are several ways to do this. The standard way adds a fraction of the distance between the personal best and global best position for each particle and a proportion of the current velocity, kinda remembering where it was heading. This gives each a momentum along a trajectory, making it veer towards somewhere between its best spot and the global best spot. You'll need to pick the fractions. Using w, for weight, since we're doing a weighted sum, and c1 and c2 for the other proportions, we have:}

v_x,t+1 = wv_t + c₁(p_t-x_t)+c₂(g_t-x_t)

If you draw the particles moving around you will see them swarm, in this case out of the paper bag.

_{This is one of many ways to code your way out of a paper bag covered in my book. When a particle solves a problem, here being out of the bag, inspecting the x and y values gives a solution to the problem. PSO can be used for a variety of numerical problems. It's usually described as a stochastic optimisation algorithm. That means it does something random (stochastic) to find the best (optimal) solution to a problem. You can read more here.}

Tuesday 16 July 2019

Decision trees for feature selection

I asked twitter who is using decision trees and what for. Most were using them, unsurprisingly, to make decisions. It wasn't always clear how the trees themselves were built.

If you are armed with data, so that each row has some features and a category - either yes/no, or one of many classes - you can build a classifier from the data. There are various ways to decided how to split up the data. Nonetheless, each algorithm follows the same overall process. Start with a tree root node, with all the data, and add nodes with part of the data.

Then

Pick a feature
Split the data set, some to the left branch, some to the other branch (or branches) depending on the value of the feature
If all the data at a node is in the same category (or almost all in the same category) form a leaf node
Continue until each node is a leaf node.

This is a bit like a sorting algorithm: in quick sort, you choose a pivot value and split the data down one branch or the other, until you have single points at nodes. Here we don't choose a pivot value but features. The way to pick a feature can be based on statistics, information theory or even at random. At each step, you want to know if all the items in one category tend to have the same value or range of values of a feature. Once you are done you have a tree (or flow chart) you can apply to new data. Each way to split has various pros and cons. You can even build several trees. A random forest will build lots of trees and they vote on the class of new, unseen data. You could build your own voting system, using a variety of tree induction techniques. This might avoid some specific problems, like over-fitting from some techniques.

You can use decision tree induction in a variety of places, even if you don't want a full decision tree or ruleset. A rule set is a tree written in as a sequence of if statements. For example,

If currency is USD then data goes missing.

If you are moving a data source from one provider to another, and some data goes missing, can you spot what the missing items have in common? You could do a bit of manual investigation, say using pivot tables and data filters in Excel. However, a decision tree might find common features far more quickly than you can. This is a form of feature selection - using an algorithm to find pertinent features. Find a decision tree implementation in a language you know, or write one yourself, and have an experiment.

My book, Genetic Algorithms and Machine Learning for Programmers, has a chapter explaining how to build one type of decision tree. Take a look. Lots of machine learning frameworks also have tools to help you build decision trees. Next time you want to know what certain things have in common, try a decision tree and see you you learn. Machine learning is often about humans learning, rather than the machines.

Friday 3 May 2019

Log driven development

Everyone knows attempting to figure out what's happening by resorting to print statements is desperation. Everyone knows you should use TDD, BDD or some xDD to write code well, don't they? I am desperate. I suggest PDD, print driven development is slow, error prone but sometimes helpful.

I recently shoe-horned a new feature into a large existing code base. It's hard to make it run a small set of work and I wanted to check a config change create my new class via the dependency injection framework. Now, I could put in a break point and wait a while for it to get hit, or not. Fortuenately, the code writes logs. By adding a log line to the constructor, I could tell much more quickly that I'd got the config correct. Write the code, leave it running, have a look later. I am tempted to name this Log Driven Development.

On one hand, adding the moral equivalents of prints is a bad thing. Teedy Deigh mused on logging in Overload 126, saying

It is generally arbitrary and unsustainable and few people are ever clear what the exact requirements are, so it spreads like a cat meme.

And yet, if I add informative logging as things happen, I can see how the whole system hangs together. Jonathan Boccara mentioned finding logs when trying to understand an unfamiliar code base, in his talk at the ACCU Conference. Existing logs can help you understand some code. Looking at this from a different direction begs the question are my logs useful? As a first step seeing a log entry when I wired in my class felt like a win. It told me what I wanted to know. I must make sure the logs I write out as my class starts doing something useful are themselves useful.

This is a step in the direction of a walking skeleton style approach - get things running end to end, outside-in, then fill in the details. I'm much more comfortable with a TDD approach, but keeping an eye on the whole system and how it hangs together is important. Building up the new feature by writing useful logs means the logs will be useful when it's released. LDD - log driven development. Why not?