The dreaded two's complement
14 May 2012Two’s complement math is like an odometer where you use the upper half of the values to represent negative numbers. In decimal terms if you had six digits – i.e. 999,999 was the highest the odometer went before rolling back over to 000,000 then you would simply treat the top half – 500,000 to 999,999 as negative values. 999,999 is one less that 000,000 so it would represent negative one. Likewise 999,9998 would represent negative two. In binary this representation is usually called the “two’s complement” because the easiest way to calculate it is to take the “one’s complement” (the straight up complement of each bit – i.e. change every zero to one and every one to zero) and then add one. The problem with the straight up ones complement is that you end up with two representations for zero – not only is that wasteful, but it can get kind of confusing. I’ll admit the two’s complement can get a bit confusing as well, but it has a number of mathematical conveniences – namely addition, subtraction and multiplication just happen to work the same as if you were dealing with unsigned values (with certain caveats).
So why, you ask, am I going on about this thing called the two’s complement? Because it confuses me a bit. I get it, I like it, it’s fascinating, and yet sometimes I have a hard time wrapping my brain around all the implications and usually I just choose to take for granted the fact that it works and let it go. This is one of those times when it’s getting in the way though and I just have to suck it up and take a little time to grok the concept better. I intend to play around with bits (sometimes referred to as bit twiddling) and that can get a bit dangerous when you’re dealing with signed two’s complement integer values. Unfortunately, Java (and I imagine most higher level languages at this point) don’t really give you a way for specifying that you’d rather work with unsigned integer values, so signed values it is!
Signed values (and the underlying representation of data in general) are one of those things that programmers tend not to bother thinking much about these days. I feel like this should be pretty straight forward stuff to anybody with a Computer Science degree – but then again maybe it’s something that’s introduced in CS101 and many students never really revisit. Modern programming languages work at such a high level of abstraction that this the kind of magic that only Electrical Engineers who design the hardware chips that the programs sit a couple levels above really have to worry about. I didn’t study Computer Science or Electrical Engineering though – I studied Physics and German, so for me (and I imagine much of the software development community), computers were just kind of an afterthought.
I enjoy math and theory though, so I’ve always hated not being more comfortable with this whole two’s complement thing. Now seems like a great time to get up to speed… In fact, you could even say I partly chose to work on this project because I was interested in getting more comfortable with a few lower level concepts like bitwise operations and the two’s complement.
##Running around in Circles In the words of Burt Bacharach, “The world is a circle without a beginning, and nobody knows where it ends.” Pretty wise words, but then again nobody these days (except maybe Thomas Friedman) thinks the world is Flat.
In my last post I talked about how I thought it made sense to use a binary odometer to keep track of Latitude and Longitude. I just didn’t know what to call it at the time. A few days later I realized where I was going wrong in my google searches – “binary representation latitude longitude” was actually two specific. I eventually switched over to searching specifically for geometric/trigonometric terms for binary represenation of angles and degrees. I wanted something like radians, but without the whole 3.14… Pi part. Apparently the term I was looking for is called __Binary Radians__ (brads) which belong to the __Binary Angular Measurement System__ (BAMS – take that Emeril!).
Sure, a rose by any other name would smell just as sweet, but sometimes it’s nice to have a name to call what you’re thinking about, and I prefer to use the same name that the rest of the world is already using rather than proliferate redundant terms (PCA, LDA, PLSA and a few others are really just specific versions of the more general SVD sans any attribution…) So Binary Radians it is – that’s what I intend to use to store Latitudes and Longitudes, and most likely in an interleaved format. I do have some reservations about this whole two’s complement signing business though. I think in many cases, it just works, but I’m not entirely sure if that’s only because the world really is conveniently a circle and the odometer metaphor is accurate.
Let’s go back to what I was saying last time about the circular nature of degrees. To make things easier for most people, I’ll stick with speaking in the conventional degrees for now – i.e. 360 degrees in a circle. Degree’s are periodic with a period of 360 – that means 360 and 720 are all the same as 0 and 361 is the same as 1. It also means 361 is the same as 1 and 359 is the same as -1. Maybe you can start to see where this is exactly like the two’s complement odometer representation I was describing above. 180 is the same as -180, 181 is the same as -179, etc. Normally when you use unsigned integer values, 00000000 (all zeros) is the minimum value you can represent and 11111111 (all ones) is the maximum value. When you switch over to the two’s complement version of signed values, 01111111 (one zero followed by all ones) becomes the maximum value you can represent and 10000000 (one one followed by all zeros) becomes the minimum value. In terms of degrees, unsigned values would be appropriate for representing 0 to 360 (or really 359.9999), but this is really the same as using signed values to represent -180 to +180 (or rather +179.999). The underlying representations are the same whichever way you think about it – it conveniently just works because the world really is a circle. It’s also somewhat convenient that the cannonical form of specifying latitude and longitude just happens to be from -180 to +180. Since I’m forced to use signed two’s complement integers, this all works out rather nicely.
###So what’s the problem?
Yeah, that part. Maybe there is no problem. Most likely I’m the problem – I’m just not entirely comfortable with it yet, and don’t trust myself not to make a mistake. Unit tests help, but they’re not the same as real mathematical proofs – unfortunately I’ve never been that good at writing proofs.
While this system works great for global latitudes and longitudes, I’m not yet certain how well it extends to an interleaved vector like system or to local zones that aren’t circular. By interleaved vector like system, I’m referring to the idea of interleaving the binary radian values of latitude and longitude to create a scalar representation of something that’s really a two dimensional vector. Once again, if you’re interested in the interleave concept, look up space filling curves, morton numbers of z-shaped curves. The general idea is to reduce the number of dimensions from two to one in such a way that two points which are close together in the two dimensions will be close together with greater probability in the one dimension and two points which are farther apart in the two dimensions will be farther apart with greater probability in the one dimension. Hilbert curves work the best, but Z-shaped curves are easier to implement and work better than straight up projections.
Even Hilbert curves still have the problem with edge conditions (two points on the opposite side of extreme edges may be close together but appear to be extremely far apart in the one dimensional representation) but that’s what I’m hoping to address with my “add one third” solution. True, it means using three one dimensional representations, but that really is still one dimension smaller smaller than one two dimensional representation.
So suppose we have this nice interleaved scalar representation of a two dimensional vector. If each of the dimensions of the vector come from a signed 32 bit integer value, then the interleaved scalar will be a 64 bit “long” value. The 64 bit value will likewise be signed and treated as a two’s complement – part of me still has reservations that this could cause problems when I perform certain operations on the value. It would be nice to have utility functions which allow me to work with these scalars as if they still were the underlying vector values – suppose I’d like to add or subtract two vectors for instance (i.e. adding a third like I’ve described before). Vector addition should still be possible short of de-interleaving the values by peforming a little creative masking and adding the latitude and longitude separately, then putting the values back together again.
My initial reservation was that the first bit of the mask would have to account for sign by looking at the value of the second bit in the case where I was masking the first bit (all odd bits rather than even ones). I think I’ve convince myself that this won’t be an issue simply by accepting that two’s complement signed math operations are consistent with unsigned operations and in this circular world, the underlying values are the same whether we’re dealing with signed or unsigned values.
Unfortunately this proof doesn’t work when I try to extend the same rules to non circular local chunks of the world. This entire system is largely motivated by the need for a convenient way to “divide and conquer”. The world’s a big place and all too often when mining geolocational data, locality is important. While a butterfly flapping it’s wing’s in Brazil may or not eventually set off a chain of reactions that leads to a sandstorm in the Sahara Desert, we’re usually more interested in more local interactions. When looking at data in Chicago, it’s nice to limit yourself to data in and around Chicago – not data on the other side of the world. A tiling system that lets you easily divide the world into discrete chunks and group the data within those chunks into independent buckets would be nice for mining geolocational data. Further, it would be nice if you could use the same operations to work with those tiles or chunks regardless of how big they were, or where they were. I’ll admit, that’s been the real goal in the back of my mind that motivated this whole project in the first place.
In such a system, I picture local “square” chunks or tiles, which can be treated much like the square map of the world I presented in my last post. This is where I have to start worrying about the odometer metaphor though – unlike the square map of the world I presented before, these local chunks do not follow the circular pacman like geometry that walking off one edge of the map will bring you back to the opposite edge of the map. Instead walking off one edge of a local tile will transport you over to an all together different tile. __Local tile operations will have to account for overflow.__ The solution may be as simple as reserving a bit for overflow – I’m not sure yet, but I have a feeling I’ll need to get a nice drawing put together to make myself feel more confident of whatever solution I do come up with.
Any local tiling system would require an offset – where is the local tile located relative to the greater global system – as well as a description of size (unless the size was fixed which would be pretty inconvenient since the whole point is to handle arbitrary sized chunks of local data…) Initially I had been picturing the offset pointing to the lower left hand corner of the tile. This is largely because I had been thinking of the offset as more of a lexographic prefix then an offset – a local tile would simply be all interleaved latitude/longitude values that shared the same first n bits. That same rule can still apply even if the first n bits are followed by 110000… (pointing to the center of the tile rather than 000000… which would point to the lower left hand corner of the tile). The center point makes more sense however when the local offsets are signed values (positive and negative) rather than unsigned (positive only). It’s also more consistent with a system where points and tiles are really the same thing – a geographic point may have limited accuracy as in we think it’s located at about x, y but we’re not entirely sure and it might be plus or minus n meters. In that case a point is itself a tile with finite size.
A system that represents points and tiles simililarly would be convenient because it would mean one set of operations that could be shared across the board (not that all operations would make sense to use in all situations).
I realize I’ve described more questions than answers, but again, this whole blog is really little more than my personal journal for keeping track of my own thoughts. If it happens to help anybody in their own work, all the better. That’s it for now though. I’ll come back when I have a nice pictogram of how to do it =)