Skip to content

Extracting A Query String Parameter With Mod Rewrite


I was recently working on a website that was being modernized and we had to redirect the old site to the new site.  This shows the solution along with an explanation. This is targeted to those familiar with mod-rewrite.

Extracting a single parameter

<IfModule mod_rewrite.c>
  RewriteEngine On

  RewriteBase /

  RewriteCond %{REQUEST_URI} ^/oldpage\/oldfile\.php [NC]
  RewriteCond %{QUERY_STRING} ^.*[&]?id=(\d+).*$ [NC]
  RewriteRule . [R=301,L]


This will check the URLs:

and turn it into

A breakdown of the query string:

^ Asserts position at the beginning of a string
.* Read any and all characters up to the next match
[?]? Match a ? zero to 1 times (allows the parameter to be listed after other parameters)
id=(\d+) Match literally “id=” then read all (at least 1) decimal numbers into a back reference.  \w instead of \d will capture all standard letters and digits
.* Match the rest of the line
$ Assert the end of the line

The question mark at the end of the rewrite rule specifies that the old query string will not be appended to the new URL.  The [NC] at the end of each of the RewriteCond lines specifies that the match will be case insensitive (id=1,Id=1).   [R=301, L] means the redirect will be a type 301 (permanent) and that to stop processing rewrite rules (L is for last).

Extracting Multiple Parameters

Multiple parameters can only be extracted if you know the order they are listed in the URI.  The following example would extract ‘id’ as a number and ‘category’ as a word. The order must be accurate. The following would match

and turn it into

Notice the order of the capture groups, ‘id’ is capture into %1 and ‘category’ into %2.

  RewriteCond %{REQUEST_URI} ^/oldpage\/oldfile\.php [NC]
  RewriteCond %{QUERY_STRING} ^.*[&]?id=(\d+).*&category=(\w+).*$ [NC]
  RewriteRule . [R=301,L]

Shortest Path First Routing: Djikstras Algorithm in C


I took over a project consisting of PIC chips interfacing with basic 802.15.4 radio modules.  These modules had no routing or mesh networking abilities.  The only feature was the ability to send and receive to other modules that were within range.  When I first received the project the routing method was that every packet was broadcast to every other node, which in turn broadcast it to every other node, and so on.   There was a max hop count in the packet to limit resends.  This worked fine with a few modules but as the network grew the packet traffic went up almost exponentially.  After researching routing methods I decided on link-state SPF routing based on Djikstra’s algorithm.

Djikstra’s algorithm requires every node to maintain an entire ‘map’ of the network that consists of a list of every node and who those nodes can see.  This was implemented by having each node ping out, maintain a list of nodes it could see, and distributing this information every couple of minutes automatically or whenever a change was detected.  This method combined with Djikstra’s algorithm created a self-healing mesh network with efficient shortest-path routing.

The details on Djikstra’s algorithm are beyond the scope of this article.  The code is based off the pseudo code described on the wiki page.


The one downside to link-state routing vs vector based is RAM usage goes up exponentially based on the size of the network.  I only had a couple KB of ram to work with and the required network size just happened to fit in the RAM.

In my implementation I was able to cram an entire link-state record into 2 bytes.  There were 6 bits for the module address, 2 bits for the role (multiple device roles), 6 bits for the age, and 2 bits for the distance metric.  Distance was mapped down to 2 bits as values 0-3 (treated as 1-4) based on the radio connection quality metric 0-255.

With a 2 byte link-state the RAM required is:

max-devices == n (( n * 2)+5) * n = bytes
max-devices == 10 ((10 * 2)+5) * 10 = 250 bytes
max-devices == 20 ((20 * 2)+5) * 20 = 900 bytes
max-devices == 40 ((40 * 2)+5) * 40 = 3400 bytes

In the example code below I have given each value its own variable for ease of reading the code which greatly expands the RAM usage.  This implementation only concerns itself with keeping an up to date list of the first hop address for every node on the network.

Link-state and Routing table entries:


typedef struct _LINKSTATE_ENTRY
  int8 destAddr;
  int8 distanceMetric;
  int8 role; //for routing, not djikstra
  int8 age; //for routing, not djikstra
} LinkstateEntry;

typedef struct _ROUTING_ENTRY
  unsigned int8 devAddr; // address of the device
  unsigned int8 firstHop; // address of first hop
  unsigned int8 ageSeconds; //seconds since update
  unsigned int8 sequence; //last sequence number
  unsigned int8 visited; //for djikstra's algorithm
  unsigned int8 pathLength; //for djikstra's algorithm
  unsigned int8 prevNode; //djikstra
  unsigned int8 neighborCount; //number of linkstates for this node entry

  //list of reachable nodes
  LinkstateEntry linkStates[MAX_DEVICES_ON_NETWORK-1];
} RoutingEntry;

RoutingEntry routeIndex[MAX_DEVICES_ON_NETWORK-1];//local routing table
LinkstateEntry personalLinkstate[MAX_DEVICES_ON_NETWORK-1];//local node linkstate that gets distributed to everyone
unsigned int8 neighborCount = 0; // number of neighbors in local link state

Any time the link-state table for a node is updated, Djikstra’s algorithm is run to determine the first hop for all nodes. This implementation is not concerned with the path, just the first address on that path. This is a bit complicated. The first function below sets everything up. It calculates all the path lengths to each node, then for each node on the network calculates the first hop address. Whenever a device receives a packet with a delivery address other than itself, it merely references the routing table to see the first hop address for the destination and passes it on.  The routing table entry index value (routeIndex[x]) directly corresponds to a node radio address.

void Djikstra_Recalculate_Routes()
  int8 x;

  Djikstra_Recalculate(); //calculate all path lengths

  for( x = 0; x < MAX_DEVICES_ON_NETWORK-1; x++ )
    if( routeIndex[x].devAddr ) //if device is in table
      FindFirstHop( routeIndex[x].devAddr );//find first hop for routing

void Djikstra_Recalculate(  ) {
  unsigned int8 x;
  unsigned int8 addr;
  unsigned int8 tempAddr;
  unsigned int8 alt;
  unsigned int8 neighborAddr;

  for( x = 0; x < MAX_DEVICES_ON_NETWORK-1; x++ ) {
    if( routeIndex[x].devAddr != 0) // if device is in routing table
      routeIndex[x].visited = 0; //mark as unvisited by the algorithm
      routeIndex[x].pathLength = 255; //mark path as infinite
      routeIndex[x].prevNode = 0;

  //reset all
  for( x = 0; x < neighborCount; x++ ) {
    addr = personalLinkstate[x].destAddr;

    if( !routeIndex[addr-1].devAddr ) //if we don't have a routing entry yet

    routeIndex[addr-1].pathLength = personalLinkstate[x].distanceMetric;
    routeIndex[addr-1].prevNode = GetMyAddress();
  tempAddr = Djikstra_GetShortestUnvisited();

  while( tempAddr ) {
    routeIndex[tempAddr-1].visited = 1;
    alt = routeIndex[tempAddr-1].pathLength;

    for( x = 0; x < routeIndex[tempAddr-1].neighborCount; x++ ) {
      if( NotInRoutingTable( routeIndex[tempAddr-1].linkStates[x].destAddr ))
      neighborAddr = (routeIndex[tempAddr-1].linkStates[x].destAddr);
      if( (alt + (routeIndex[tempAddr-1].linkStates[x].distanceMetric)) < routeIndex[neighborAddr-1].pathLength )
        routeIndex[neighborAddr-1].pathLength = alt + (routeIndex[tempAddr-1].linkStates[x].distanceMetric);
        routeIndex[neighborAddr-1].prevNode = tempAddr;
    tempAddr = Djikstra_GetShortestUnvisited();

int8 Djikstra_GetShortestUnvisited() {
  int8 x;
  int8 curShortest = 255;
  int8 curShortestAddr;

  for( x=0; x < MAX_DEVICES_ON_NETWORK-1; x++) {
    if( ! routeIndex[x].devAddr )

    if( !routeIndex[x].visited && routeIndex[x].pathLength
        && routeIndex[x].pathLength < curShortest ){
      curShortest = routeIndex[x].pathLength;
      curShortestAddr = routeIndex[x].devAddr;

  if( curShortest == 255 )
    return 0;
    return curShortestAddr;

void FindFirstHop( int addr ) {
  unsigned int8 prevNode;
  unsigned int8 prev;

  prevNode = routeIndex[addr-1].prevNode;

  if( prevNode == GetMyAddress() ) {
    routeIndex[addr-1].firstHop = addr;

  while( prevNode ) {
    prev = routeIndex[prevNode-1].prevNode;

    if( prev == GetMyAddress() ) {
      routeIndex[addr-1].firstHop = prevNode;
    prevNode = prev;


This is custom implementation of Djiktra’s algorithm actually in use for a packet routing system in a network of up to 24 devices.  The performance has been very good and I believe it could easily scale up to over a hundred devices (given enough RAM) on the 16MHz PICs.  The actual routing implementation includes a hop count field to limit the effect of a loop.  A loop could form if a device is removed from the network and the entire network is not in sync or if there was a radio signal strength anomaly.  There are probably better ways to handle that but our requirements have addition/removal of devices being rare.

Top – Linux System Metrics: Statistics And Interpretation


An important part of running Linux servers is to be able to identify and interpret the performance metrics the operating system provides.  It is usually easy (but not always) to identify performance bottlenecks.  Many a system administrator has gotten the monitoring alert that things are wrong.  With a bit of experience it should be fairly easy to identify, fix, and prevent the problem before the owner or stakeholder gets notified by a downstream alert or angry client.


‘top’ is probably the most used utility for this task.  Top shows the time, uptime, users, load average, task info, cpu info, memory info, and individual process info.  99% of the time, in my experience, this data will help lead to the issues.  Here is an example ‘top’ header output from a low power virtual machine:

top – 02:57:33 up 37 days, 9:52, 3 users, load average: 0.00, 0.00, 0.00
Tasks: 80 total, 1 running, 79 sleeping, 0 stopped, 0 zombie
Cpu(s): 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 534332k total, 456008k used, 78324k free, 4236k buffers
Swap: 1048568k total, 132972k used, 915596k free, 169448k cached

1 root 40 0 23588 964 536 S 0 0.2 3:38.65 init
2 root 40 0 0 0 0 S 0 0.0 0:00.00 kthreadd
3 root RT 0 0 0 0 S 0 0.0 0:00.06 migration/0
4 root 20 0 0 0 0 S 0 0.0 0:00.00 ksoftirqd/0
5 root 20 0 0 0 0 S 0 0.0 0:00.07 events/0
6 root 20 0 0 0 0 S 0 0.0 0:00.00 khelper

Load Average

Load average is usually a very trustworthy metric on how the general system is doing.  I say usually because I have experienced situations where the load average belies system health and I will cover that.  If your system must respond in real-time you do not want it to get backlogged with CPU demand.  If you do not need real time response, based on experience, your load should not exceed 4-8x the number of cores for any appreciable amount of time.  At some point the CPU demand coupled with the underlying kernel requirements for context switches and interrupts could cascade out of control and just about freeze the system.

A load average metric represents the number of processes demanding CPU-time for a given time frame.  If you have one single program demanding 100% cpu time for 1 minute, the load average is 1.00 for that 1 minute.  If you have one single program using 100% of the CPU for .1 minutes, the load average would be .1 for a minute period.  The three load averages given by ‘top’ are for 1, 5, and 15 minute intervals.  As a general rule of thumb you do not want your load average to exceed the number of CPU cores.  Here are some examples:

  1.  load average: 1.00, 0.20, 0.06  – computer generally idle, 1 program at 100% for 1 minute; should be ok for a 1+ core system
  2. load average: 0.00, 0.5, 2.50 – computer idle for the last minute, had a bit of load over the last 5 minutes, something intensive the last 15 minutes; questionable for 2+ core system
  3. load average: 2.15, 3.20, 1.50 – constant load; problematic for a single core system, pushing limits for a 2 core system, probably ok for a 3+ core system
  4. load average: 6.25, 9.76, 8.44 – serious load; could see major delays for anything less than 8 cores, i’d recommend 16 cores for this load.

Exception: the one instance I have seen a good system yet crazy load averages ( e.g. 45.00, 56.00, 47.00 – quad core) has been running Tomcat servers.  They appear to have processes that demand CPU time but do not actually use the CPU for some reason.  The system was very responsive (low cpu usage) yet according to the load should have been hung up.  If the load points to problems it is always good to make a web, db, or other application request to confirm things are slower than they should be.  A slow shell generally points to lagging performance.


The task statistics show the total, active, sleeping, stopped, and zombie processes.  The total is simply the total of processes loaded into memory.  Active processes are the current number of processes that are running.  Sleeping processes are those that are active but ‘blocking’ at the system level waiting for a reason to run.  Stopped processes are active processes that have been ‘paused’ via a signal.  Zombie processes are those which have exited but the parent process has not acknowledged that exit.  Zombie processes used to be quite common years ago but nowadays rarely show up in the most used applications.  I view any zombies as a sign of problems unless dealing with an application that routinely creates them, which is very rare.

Generally unless there is a runaway spawning of programs the number of tasks will be of little concern.  Each process requires a small amount of CPU time to check on it to see if it needs to be run.  On modern processors even several hundred programs will require a fairly minuscule amount of management overhead.  If for some reason there are many thousands of processes running this could begin to take a noticable amount of CPU to manage them.  There could be also be significant CPU usage if there are many hundreds of programs waking up many hundreds of times per second even if they go almost directly back to sleep.

CPU Statistics

The CPU usage statistics are broken up into eight categories.

  • us – user:  User space programs
  • sy – system:  Kernel usage
  • ni – nice: User space programs running at the lowest priority, any other priority levels would fully preempt these
  • id – idle: Percent of cycles not being used
  • wa – wait: Percent of cycles not used due to IO waits, e.g. Hard drive access
  • hi – hardware interrupt: Percent of cycles spent in hardware interrupt code
  • si – software interrupt: Percent of cycles spent in software interrupt code
  • st – stolen time: Percent of cycles demanded given to other virtual machines (when your system is a virtual machine)

If you have a single core or are not listing all CPUs these percentages will equate to total system capability. Pressing the ‘1’ key will expand the CPU list to show the stats for each individual CPU. In combined mode on a quad core system, 25% CPU usage would equate to a single core at full usage. In expanded mode the single CPU would show as 100%.

‘us’ or user time is generally the one to keep an eye on.  If you are running at full capacity you may need to upgrade or spread the workload.  If things are slow and only the ‘wa’ (wait) value is high, then the bottleneck is with IO (network, disk, etc).  If ‘st’ is high the VM host may be too busy, think about spreading out your VMs.  ‘sy’ (system) and ‘hi’ (hardware interrupts) may be higher than  normal during heavy network loads.


The memory statistics show the system-wide usage of your available memory.  The metrics are ‘total, used, free, buffers, cached’.  ‘Cached’ shows up on the swap line but is not related.

  • Total – The total amount of RAM available to the system
  • Used – The total amount of RAM allocated by the system.  This includes ‘buffers’ and ‘cached’.
  • Free – The total amount of pure unused RAM in the system.
  • Buffers – The total amount of RAM used for critical system buffers
  • Cached – The total amount of cached data.  This data will be free’d whenever there is not enough ‘free’ RAM in the system.

The major thing to keep in mind is that ‘cached’ data is almost as good as ‘free’ data.  If you have 8GB of ram, 7.5GB used, 7GB in cache, and .5GB free, you effectively have 7.5GB of unused RAM.  The cache will fill up with accessed disk bocks and things to make their future access faster, but will be given up the moment more RAM is needed.  The more that is cached the better the performance will be from disks, but you can’t just go by the ‘free’ value.

Swap should generally not be actively used if a system requires real time performance.  Swap pushes out memory blocks to storage devices to make more free RAM.  Generally storage devices are very slow compared to RAM and constant use will affect system performance heavily compared to physical RAM.

Process Statistics

The process statistics show stats for the individual running processes: PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND

  • PID – the process ID.  This can be important to know when you want to send signals to a process (kill for example)
  • User – this is the username or user id of the user running the process
  • PR – this is the priority of the process
  • NI – this is the ‘nice’ level of the process
  • VIRT – this shows the ‘total’ ram accessible by the process.  It may be very large because it includes all mapped libraries, mapped buffers (e.g. video), and shared memory.
  • RES – resident usage.  This number is the physical total of the program itself and used libray components and is directly used for the %MEM field.
  • SHR – shared usage.  This represents the RAM that is actually shareable.  A mapped library will show up under VIRT and SHR while only the components in use will be added to RES.
  • S – state.  This can be ‘D’ (uninterruptable – e.g disk access), ‘R’ (running), ‘S’ (sleeping), ‘T’ (being traced), ‘Z’ (zombie)
  • %CPU, %MEM – show the current instant CPU and RAM usage respectively.
  • Time+ – the total number of CPU time used since the process has started.  On a quad core machine a constant running 4 thread program can use almost 4 seconds of CPU time per second.
  • Command – the name of the command that was run.


‘top’ is one of the most important tools in the Linux system administrators toolbox.  Learning to interpret the output is key to successfully troubleshooting problems and identifying system and process health.

System Uptime Personal Record: 4.75 Years And Growing (Debian Linux)


*update* currently 5.2 years (04:40:08 up 1900 days, 15:05,  1 user,  load average: 0.00, 0.00, 0.00)

I’ve worked for a small company for many years that does a lot of things, one of them being hosting.  I have the opportunity to run and maintain dozens of machines for various private sector contracts.  My personal choice would be FreeBSD for a bare metal OS but, since it is fairly hard to find BSD admins compared to Linux (read: pay grade), I have to run Linux in case I ever leave.  Windows has gotten better over the years yet Linux and especially BSD has always been renowned for its ‘rock solid’ behavior.

We have one server in a group that thus far has not seen a reboot since it went live.  This debian based machine is firewalled off except for web services, Apache got needed security updates, and we started with an older stable kernel that has yet to see a critical security update that applies to the network environment.   The contract has expired and the machine is not currently used ( gotta love bureaucracies ).  I’ve taken screenshots of the uptime for myself over the years… at 2, 3, 4 and now 4.75 years.

 22:17:35 up 1737 days,  9:43,  1 user,  load average: 0.04, 0.08, 0.07

This is a testament to Linux and open-source.  I would not be surprised if Windows could match this now but it still impresses me.  Below I have posted two nearly famous images depicting the call stack of a web request for Linux based Apache and for Windows IIS.  By itself it does not mean much but history has generally shown that the more complex software is, the more prone it is to problems.  Although being over half a decade old, these images just about speak for themselves.  Efficient software can be made proprietorially I am sure, but from my experience does not happen unless it is a serious requirement.  Windows is more of a pumped out ‘make it work’ system whereas large projects in the open-source arena benefit from a lesser bureaucracy, letting more eyes see the code, and more work and acceptance towards elegant solutions.

These images are unfortunately not full resolution.

Thanks Linux and the open-source community.  I will continue to update this until the server is rebooted or decommissioned.

The XOR Linked List In C++


It is possible to store the pointers in a linked list in such a way that they ‘overlap’ by using the logical XOR operator.  This allows a user to cut the memory footprint of a basic linked list by 50% at the expense of a couple extra CPU operations.  In today’s world this technique would generally not be used on a desktop or server system (without extreme requirements).  It is more suited towards embedded development on tiny microprocessors that may have a mere few kilobytes of precious ram when the XOR linked list happens to be the most memory efficient way to represent the data.

This implementation is based off the writeup on Wikipedia.  I have created a basic linked list class that supports inserting at the tail and removing from the head.  Intended usage is for this demo is non-duplicate pointers.

XOR is a very interesting logical operator that can be difficult to understand.  There is a bitwise operator refresher located here.  The basic rules are:

  • 0 ^ 0 = 0
  • A ^ 0 = A
  • A ^ A = 0
  • A ^ B = B ^ A
  • (A ^ B) ^ A = B
  • (B ^ A) ^ B = A

When we apply this, instead of having a ‘next’ and ‘prev’ like a normal linked list, we have one ‘NextPrev’ that holds the XOR of both the ‘next’ and ‘prev’.

  • NextPrev = Next ^ Prev
  • Next = NextPrev ^ Prev
  • Prev = NextPrev ^ Next

Based on the rules above, both head ( with Next as 0 ) and tail (with Prev as 0) maintain the actual pointer to the next object because XOR with zero leaves the original bits.  As a side effect the head and tail pointers are the only entry points.  If there is a pointer to an object in the middle of the list, there is no way to traverse from that object without knowing the actual next or previous item.  Another caveat is that debuggers will not be able to recognize the pointers in the list other than ‘head->prev’ and ‘tail->next’.


This is the header file for the list.  The behavior of the list is:

  • If there are no entries head and tail are 0
  • If there is one entry, head is the entry and tail is 0
  • If there are 2 or more entries head and tail are both set as are the items m_NextPrev variable that holds both next and previous



class DoubleLinkedXOR
	DoubleLinkedXOR() { m_head = m_tail = 0;}

	void AddToTail( T entry )
		if( !m_head )//no items
			entry->m_NextPrev = 0;
			m_head = entry;
		else if( !m_tail ) //just head is set, one item in list
			m_tail = entry;
			m_tail->m_NextPrev = m_head;
			m_head->m_NextPrev = m_tail;
		else//two items or more, insert
			entry->m_NextPrev = m_tail;
			m_tail->m_NextPrev = (T)((int)(m_tail->m_NextPrev) ^ (int)entry);
			m_tail = entry;
	T RemoveFromHead()
		T item = m_head;

		if( !m_head )
		    return 0;
		else if( !m_tail )//just one item
			m_head = 0;
			return item;
		else if( m_head->m_NextPrev == m_tail
			&& m_tail->m_NextPrev == m_head ) // only two items
			m_head = m_tail;
			m_tail = 0;
			m_head->m_NextPrev = 0;
			return item;
			m_head = m_head->m_NextPrev;
			m_head->m_NextPrev = (T)((int)(m_head->m_NextPrev) ^(int) item);
			return item;
	T m_tail;
	T m_head;

The only general requirements of the above template are that the type instantiated is a pointer to an object with a *m_NextPrev member of the same type.  Below is the test object used in the example.


#ifndef __ITEM_HEADER__
#define __ITEM_HEADER__

typedef struct _ITEM {
	int id;
	struct _ITEM *m_NextPrev;
} Item;


Below is the simple test app I whipped up for the proof of concept. It creates x number of items and removes them as a FIFO.


#include <stdio.h>
#include <stdlib.h>

#include "Item.h"
#include "DoubleLinkedXOR.h"

int main( int argc, char *argv[] )
	DoubleLinkedXOR  *list = new DoubleLinkedXOR ();

	int x, count = 10;

	for( x = 0;  x < count; x++ ) //add all items to list
		Item *newItem = (Item*)malloc(sizeof(Item));
		newItem->id = x;
		printf("Added item: %d\n\r", newItem->id);

	for( x = 0; x < count; x++ )
		Item *retrievedItem = list->RemoveFromHead();
		printf("Retrieved item: %d\n\r", retrievedItem->id);

	delete list;

	return 0;

And my output looks like this when count = 10:

Added item: 0
Added item: 1
Added item: 2
Added item: 3
Added item: 4
Added item: 5
Added item: 6
Added item: 7
Added item: 8
Added item: 9
Retrieved item: 0
Retrieved item: 1
Retrieved item: 2
Retrieved item: 3
Retrieved item: 4
Retrieved item: 5
Retrieved item: 6
Retrieved item: 7
Retrieved item: 8
Retrieved item: 9

The files are here.

Endianness: Big Endian vs. Little Endian; Shifting and Casting Examples


I ran into a bug recently where I was trying to extract the lower 32 bits from a 64 bit pointer referencing two contiguous 32 bit variables in memory.   They were taken out opposite what was expected due to my confusion with the way data is stored.  There is a fairly good writeup on Wikipedia but not very many examples.  I also highly recommend reading this article about writing Endian-independent-code.

In general endianness only has to do with the byte order of multi-byte CPU built in types ( int16, int32, int64, etc).  The bits within each byte are generally treated the same by most system.  Bit endianness is much rarer.

Endian Detection

From what I understand it is safest to detect endianness via a union instead of a cast since the latter may not be supported in all implementations. Here is an example of endian detection:

int IsLittleEndian()
    uint32_t i;
    uint8_t c[4];
  } endianCheck = {0x04030201};

  return endianCheck.c[0] == 1;

To explain, there is one 32 bit memory location that can be accessed as a 32 bit variable or an array of 4 characters.  If the memory address for the 32 bit integer (4 bytes) is X, the first char array byte c[0] is at X as well, with c[1] being at memory location X+1 byte, c[2] at X+2 bytes, and c[3] being at X+3.

The secret is the bytes are reversed depending on the endianness.  From the detection function, 0x04030201 looks like this:

Memory Location Little Endian Big Endian
X 0x01 0x04
X+1 0x02 0x03
X+2 0x03 0x02
X+3 0x04 0x01

From the table you can tell that the byte order is reversed and that for code to be portable you would have to take into account the individual byte positions if portions of a multi-byte primitive are to be modified.

Effect on Shifting and Casting

Explicit shifting and casting to convert data is one of my favorites even though casting mistakes can be dangerous.  This requires a different version depending on the endianness.  Given the above 32 bits in the detection function set to 0x04030201:

int var = 0x04030201;
//extract the most significant byte (0x04) from 0x04030201
//little endian:
uint8_t extractedByte = (uint8_t)(var >> 24);
//big endian:
uint8_t extractedByte = (uint8_t)var;

//extract least significant byte (0x01)
//little endian
uint8_t extractedByte = (uint8_t) var;
//big endian
uint8_t extractedByte = (uint8_t) ( var << 24 );

//extract extract 2nd least significant byte (0x02)
//little endian:
uint8_t extractedByte = (uint8_t)(var >> 8);
//big endian:
uint8_t extractedByte = (uint8_t)( var << 16 );

The cast forces a single byte to be read at the location of the 32 bit variable. From the table above you can tell how that can be different depending on the endianness. A brushup on bit shifting is located here.

Masking is not affected because the mask will be represented in the same byte order.


Endianness, although it can be confusing, is a simple concept that is the basis of computer memory.  It helps to understand the basics and to confirm code behavior with a good debugger.

OpenVPN And PPP on Linux: VPN Traffic Forwarding Default Gateway Fix


I use OpenVPN all the time and had the task of setting it up to function as the default gateway for an embedded linux machine that is using PPP with a USB 3G dongle.  The VPN connection worked great as always but for some reason I couldn’t get the traffic forwarded over the 3G line.  It worked fine when connected to an Ethernet line which was the first clue.

This fix assumes a working OpenVPN setup with broken traffic forwarding functionality via PPP.

PPP Anomaly

For some reason when PPP connects and is the sole outbound gateway the routing table works but the PPP default route has only the UH flags and no ‘G’ flag (for gateway).  When OpenVPN attempts to change default gateway it is unable to detect the PPP device due to the missing flag.

The Fix

I need to find the source for this.  I found this a couple years ago and it has served me well but I do not know the author.  It is a script to run after the PPP daemon comes up to add the ‘G’ flag to the default gateway in the routing table.  I have mine named 00000routefix and it lives in /etc/ppp/ip-up.d with executable permissions (auto run when PPP goes up).


if [ $(ip route list exact default |
  awk '/^default/ {print $2}') = dev ];
         IF=$(ip route | awk '/^default/ {print $3}')
         GW=$(ip address show $IF |
         awk '/peer/ {print $4}' | cut -d"/" -f1)
         ip route replace default via $GW dev $IF


The above works great for me and allows all traffic on the remote box to be securely forwarded and routed through our primary network.

Note:  The above has been tested on Debian and may require tweaks depending on command output differences.

The Difference Between Blocking, Lock-Free, and Wait-Free Queues


I recently grew an interest in lock free queues and during my research noticed some confusion concerning blocking, lock-free, and wait-free queues.  There are quite a few claimed implementations that do not meet the requirements.  This is my attempt to alleviate some confusion.  I define an interrupt safe queue as one that will succeed at some point within a single queue or dequeue call from a uniprocessor interrupt function.

Blocking Queues

A blocking queue has the effect where an active thread is able to lock the queue and attempted locks by other threads will ‘block’ and yield their scheduled CPU time to the system.  Blocking queues can be inefficient in multi-processor systems where it takes longer for the thread to be rescheduled than it would have taken to ‘spin for success’.  By ‘spin for success’ I mean retry the operation until it succeeds.  That is the basis for spin locks and lock free queues.

Conversely, a non-blocking queue is one where a thread will never voluntarily give up its allotted time-slice.  Lock-free and wait-free queues are both non-blocking.

Lock-Free Queues

A lock-free queue implementation guarantees system-wide progress but does not guarantee per thread progress.  Lock free queues often use atomic compare-exchange primitives where if the atomic call fails that means another thread has already made a change and it is time to retry or ‘spin’ until it works.  Threads may ‘starve’ because there is no guarantee of progress.  One example of starvation is that a thread may keep ‘retrying’ and worse case, under the heaviest of loads, may be continually delayed by other higher priority threads.

Wait-Free Queues

A wait free queue is a lock-free queue with guaranteed per thread progress.  This means that every operation is guaranteed to be completed within a maximum number of steps. It appears a true wait free algorithm would also be interrupt safe.  The primary difference between lock-free and wait-free is that the possible delays via lock-free methods are bounded, if not removed.

Interrupt-Safe Queues

In a uniprocessor system an interrupt function can quite literally interrupt the code flow anywhere.  This requires a fairly complex algorithm that lets the interrupt function succeed in either a queue or dequeue no matter the current state of the other threads.  This requirement may all but completely fade as the future appears to be parallel processing… yet at the moment I still work with uniprocessor systems!


Interest in queue theories are growing because of parallel processing.  Most commercial code right now is sufficient for the single and handful of cores that comprise most basic consumer systems.  Most algorithms currently in use are extremely inefficient when scaled to thousands, hundreds, or even dozens of cores.

I suspect the holy grail of wait-free queues will be extremely slow compared to others until actually utilizing a large number of processors.


Queue theory is big right now.  We have basically witnessed the advent of generic multi-core hardware within the last few years and are already facing huge scaling problems as we head into the future.  The solutions will be very interesting.

Starting a blog

Finally starting a blog after all these years to keep track of all the cool stuff I usually end up misplacing.