Print a (Cycle-free) Digraph Statistic

This code is a highly abbreviated version of code I developed in connection with my genealogical database. It is an easy, almost trivial, programming problem, but I include it because of two instructive points.

If you've gotten this far, you're a confident or aspiring programmer, and all good programmers should have a basic knowledge of graphs and their nomenclature. If not, I'll summarize briefly: Given a set of tokens (or nodes) {X, Y, ...}, a digraph is a set of ordered pairs (X,Y),.... If (X,Y) is in the digraph, we'll say ``X is a parent of Y'' or equivalently ``Y is a child of X.'' If, for example, (X,Y1), (Y1,Y2), (Y2,Y3), (Y3,Z) are all in the digraph, they constitute a ``path of length 4'' from X to Z, and Z is a ``descendant'' of X. If no node is a descendant of itself, then the digraph is cycle-free.

The trees of games in the Nim and Tic-tac-toe families are cycle-free digraphs. The trees of chess and checkers, however, have cycles. The links in a genealogical database form a cycle-free digraph since, at least until man invents a time machine, two people cannot be each others' ancestors.

After I built my genealogical database I wanted to know the answers to questions like

How many times does Charlemagne appear in my pedigree?
What is the shortest path from Charlemagne to me?
What is the longest path from Charlemagne to me?
What path from Charlemagne to me uses the most females?

The number of distinct paths can get rather large. For example there are 100 distinct paths from Charlemagne to King John Lackland and 110 distinct paths from Lackland to me, so that makes 11,000 distinct paths from Charlemagne to me not even counting those that bypass King John.

Since this is just a programming example, instead of tracing paths in a real genealogy, let's just look at the ways to get from X to Y when the only legal moves are to subtract 4, 5, or 12.

As I said, there are two instructive things to note about the software. The first is that the essential flow is the same (similarity makes things simple) for all aggregation operations (counting, finding the minimum, the maximum) even though those operations seem quite different. (Replace OP_COUNT with OP_MINLENG or OP_MAXLENG in the #define statement to do a different aggregating operation.)

If you wrote the three routines independently, without realizing their close similarity, it wouldn't much matter: these short routines can be written and modified quickly. But the concept of recognizing and exploiting similarities in data structure or processing method becomes increasingly important as problems become harder.

Some other aggregation-like operators may be handled with the same template (as I have illustrated with a OP_MOSTFIVE option). There are harder aggregate operators that can not quite be handled this way, for example finding the sum of all path lengths, or the average path length.

When you have mastered the original problem, modify it to find the sum and average of path lengths. The first ``suggested answer'' turns out to be a bad model here!

I will show two solutions to the exercise. The first is unable to solve the extended problems (sum of and average path-length) -- one would need to rip it apart, (performing code replications) and extend the data structure slightly to support these slightly harder ops. For this and other reasons the first solution is a bad solution, but I include it because it demonstrates the simplicity most clearly.

It may even be easier to code the ``bad but simple solution,'' debug, study and understand it, and then throw it away and write the improved version, than to write the improved version by itself. (In Lesson 1 we encouraged you to learn to type quickly and accurately; the convenience of practice throwaway code is one of the reasons.)

/* * Code to print a statistic about paths from X to Z * in a (cycle-free) digraph. The same code is used * for finding the length of the minimum path, length * of maximum path, and the total number of paths. * To make this a simple self-contained demo, we let the nodes * be non-negative integers, and just provide the arbitrary links * (x, x+4), (x, x+5), and (x, x+12). In other words we * generate a trivial (3-parent) pedigree, to make the code * and its demonstration self-contained. */ /* * The following `-2' is just an arbitrary `magic number' token * that a more structured code might reflect with an enum or * independent boolean variable. One purpose of this example * is to show that, although a tiny bit of magic is required, * (e.g. `-2' following) my simple approach is more concise, * and more easily written or studied than would usually arise * in a high-level language like C++. */ #define UNINITIALIZED -2 #define OP_COUNT /* replace this for Minleng or Maxleng */ #ifdef OP_COUNT #define PRIMRESULT 1 #define INITRESULT 0 #define STEP(res) (res) #define UPDATE(old,new) ((old) + (new)) #define RESNAME "number of distinct paths" #endif #ifdef OP_MINLENG #define PRIMRESULT 0 #define INITRESULT 99999999L #define STEP(res) ((res) + ((res) != INITRESULT)) #define UPDATE(old,new) ((new) < (old) ? (new) : (old)) #define RESNAME "length of the shortest path" #endif #ifdef OP_MAXLENG #define PRIMRESULT 0 #define INITRESULT (-1) #define STEP(res) ((res) + ((res) != INITRESULT)) #define UPDATE(old,new) ((new) > (old) ? (new) : (old)) #define RESNAME "length of the longest path" #endif #ifdef OP_MOSTFIVE #define PRIMRESULT 0 #define INITRESULT (-1) #define STEP(res) ((res) + ((res) != INITRESULT && chix == 1)) #define UPDATE(old,new) ((new) > (old) ? (new) : (old)) #define RESNAME "maximal number of 5-steps among all paths" #endif #define NUMNODES 250 typedef long long Bigint; struct node { Bigint memo; struct node *child[3]; } Node[NUMNODES]; Bigint aggreg_op(xp, zp) struct node *xp, *zp; { int chix; struct node *yp; Bigint subres, result = INITRESULT; if (xp == zp) result = PRIMRESULT; else for (chix = 0; chix < 3; chix++) { if (yp = xp->child[chix]) { if (yp->memo == UNINITIALIZED) yp->memo = aggreg_op(yp, zp); subres = STEP(yp->memo); result = UPDATE(result, subres); } } return result; } /* * Similarly, it is silly to have so much different functionality in * the `main()' routine as I show here. I tend to do it though, at * least in simple code examples, to keep things ... simple. */ main(argc, argv) char **argv; { int x, z; Bigint result; struct node *p; for (x = 0; x < NUMNODES; x++) { p = Node + x; p->memo = UNINITIALIZED; if (x >= 4) p->child[0] = p - 4; if (x >= 5) p->child[1] = p - 5; if (x >= 12) p->child[2] = p - 12; } if (argc != 3) { printf("Usage: a.out X Z\n\tnumbers in range 0 ... %d\n", NUMNODES - 1); exit(1); } x = atoi(argv[1]); z = atoi(argv[2]); if (0 <= x && x < NUMNODES && 0 <= z && z < NUMNODES) { result = aggreg_op(Node + x, Node + z); if (result == INITRESULT) printf("There are no paths from %d to %d\n", x, z); else printf("The %s from %d to %d is %Ld\n", RESNAME, x, z, result); exit(0); } }

For the record, Charlemagne appears in my fully expanded pedigree 523,116 times (that's just using the known links which appear in my database); in the shortest path he's my 31-great grandfather; in the longest path he's my 53-great grandfather.

Adam, the First Man, appears 968,292,531,344 times in my fully expanded pedigree; that's almost a trillion times! You can see why I used the special "Bigint" facility of the GNU compiler.

(The preceding was written when my database was only one-third the size it is now. The database now has over 5 million paths from Charlemagne to me, and the number of paths from Adam exceeds the capacity of Bigint.)

In the toy problem defined by the source code, the largest number that will be printed (by a.out 249 0) is 8229672585400633201; that's over 8 quintillion. It doesn't take long, though, to count up to that preposterous number; in fact the built-in timing facility prints "0.000 secs" --- less than one millisecond to get all the way to 8 quintillion.

This may seem like no surprise to those familiar with the power of multiplication; I mentioned earlier that the 100 paths from Charlemagne to Lackland and 110 paths from Lackland to me generate 11,000 paths without further ado. But here comes the cute and/or odd surprise I promised earlier, or at least it seems cute to me:

In the path-counting source code there are no multiplications whatsoever. In fact the only operations involved in generating the total count are (a) assigning 0 or 1 as the answer in the trivial cases, (b) remembering an earlier answer, (c) adding two answers together. Multiplication is fun, but those darned additions can add up pretty fast too!

Perhaps I'm mentally deficient to think this is odd or interesting, but if you approached this problem inferring, from the Charlemagne --> Lackland --> James example, that multiplications were essential, you'd get lost quickly. On the other hand, if you just did the easy parts of the coding first, top-down and/or bottom-up, you'd soon gasp in surprise: ``Hey, I'm done! How'd that multiplication disappear?''

As promised, here's a better, more practical, version of these functions.



/*
 * This is a slightly different (and more practical) version
 *      of the aggregating-function code example, with
 *      most of the same comments applying.
 */

#define         UNINITIALIZED   -2
typedef long long Bigint;
Bigint  Initres;

/*
 * I try to keep my ``arrangement of white-space''
 * consistent ... but *don't become dogmatic!*
 *
 * The following are closely related functions, all one-liners,
 * so use two lines per function instead of five lines, if you
 * think it will help the reader grasp the overall idea more quickly.
 */

Bigint  af_identity(res, spare)  Bigint res;
{       return res; }

Bigint  af_grow(res, spare)  Bigint res;
{       return res + (res != Initres); }

Bigint  af_sum(old, new)  Bigint old, new;
{       return old + new; }

Bigint  af_min(old, new)  Bigint old, new;
{       return old < new ? old : new; }

Bigint  af_max(old, new)  Bigint old, new;
{       return old < new ? new : old; }

struct aggop {
        char    *ag_resultname;
        long    ag_primres, ag_initres;
        Bigint  (*ag_step)(), (*ag_upda)();
} Aggop[] = {

#define AGG_CNT 0
        "number of distinct paths",
        1, 0, af_identity, af_sum,

#define AGG_MIN 1
        "length of the shortest path",
        0, 99999999L, af_grow, af_min,

#define AGG_MAX 2
        "length of the longest path",
        0, -1, af_grow, af_max,

#define AGG_TOT 3
        "total of path lengths",
        0, 0, af_sum, af_sum,

#define AGG_NUMOPS      4
};

#define NUMNODES        250
struct node {
        Bigint  memo[AGG_NUMOPS];;
        int     numc;
        struct node     *child[3];
} Node[NUMNODES];

/* NB: Must initialize for AGG_CNT before running AGG_TOT */
Bigint aggreg_op(xp, zp, agix)
        struct node *xp, *zp;
{
        int             chix;
        struct node     *yp;
        Bigint          subres, result;
        struct aggop *agp = Aggop + agix;

        if (xp == zp) {
                result = agp->ag_primres;
        } else {
                Initres = result = agp->ag_initres;
                for (chix = 0; chix < xp->numc; chix++)
                if (yp = xp->child[chix]) {
                        subres = yp->memo[agix];
                        if (subres == UNINITIALIZED)
                                subres = yp->memo[agix]
                                        = aggreg_op(yp, zp, agix);
                        subres = (agp->ag_step)(subres, yp->memo[AGG_CNT]);
                        result = (agp->ag_upda)(result, subres);
                }
        }
        return result;
}

/*
 * In this case, it would be easy to get rid of this function
 *  if that seems SIMPLER, but it is usually desirable to have
 *  ``cover'' entry-points like this.
 * In fact, eliminating ``covers'' like this for a *false simplicity*
 *  is a vice not a virtue (though a vice I personally tend to suffer).
 */
Bigint do_aggop(x, xdest, agix)
{

        return aggreg_op(Node + x, Node + xdest, agix);
}

main(argc, argv)
        char    **argv;
{
        int     x, agix, xdest;
        Bigint  result;
        struct node     *p;
        struct aggop *agp;
        Bigint  numpaths, totleng;

        for (x = 0; x < NUMNODES; x++) {
                p = Node + x;
                if (x >=  4) p->child[p->numc++] = p -  4;
                if (x >=  5) p->child[p->numc++] = p -  5;
                if (x >= 12) p->child[p->numc++] = p - 12;
                for (agix = 0; agix < AGG_NUMOPS; agix++)
                        Node[x].memo[agix] = UNINITIALIZED;
        }

        if (argc != 3) {
                x = xdest = -1;
        } else {
                x = atoi(argv[1]);
                xdest = atoi(argv[2]);
        }
        if (0 > x || x >= NUMNODES || 0 > xdest || xdest >= NUMNODES) {
                printf("Usage: a.out X Z\n\tnumbers in range 0 ... %d\n",
                        NUMNODES - 1);
                exit(1);
        }

        agp = Aggop + AGG_CNT;
        numpaths = do_aggop(x, xdest, AGG_CNT);
        if (numpaths == agp->ag_initres) {
                printf("There are no paths from %d to %d\n",
                        x, xdest);
                return;
        }
        printf("The %s from %d to %d is %Ld\n",
                        agp->ag_resultname, x, xdest,
                        numpaths);
        agp = Aggop + AGG_MIN;
        printf("The %s from %d to %d is %Ld\n",
                        agp->ag_resultname, x, xdest,
                        do_aggop(x, xdest, AGG_MIN));
        agp = Aggop + AGG_MAX;
        printf("The %s from %d to %d is %Ld\n",
                        agp->ag_resultname, x, xdest,
                        do_aggop(x, xdest, AGG_MAX));
        agp = Aggop + AGG_TOT;
        totleng = do_aggop(x, xdest, AGG_TOT);
        printf("The %s from %d to %d is %Ld\n",
                        agp->ag_resultname, x, xdest, totleng);
        printf("The average path-length from %d to %d is %f\n",
                        x, xdest, totleng / (double)numpaths);
        exit(0);
}

Please send me some e-mail.
Go back to my home page.