Date: | January 16, 2014 / year-entry #14 |
Tags: | code |
Orig Link: | https://blogs.msdn.microsoft.com/oldnewthing/20140116-00/?p=2063 |
Comments: | 62 |
Summary: | Consider the following batch file which tries to decide whether we are in the first or second half of the calendar year. (Assume US-English locale settings.) if %DATE:~4,2% LEQ 6 ( echo First half ) else ( echo Second Half ) This works great, except that it reports that August and September are in the... |
Consider the following batch file which tries to decide whether we are in the first or second half of the calendar year. (Assume US-English locale settings.) if %DATE:~4,2% LEQ 6 ( echo First half ) else ( echo Second Half ) This works great, except that it reports that August and September are in the first half of the year. What the heck? Have the laws of mathematics broken down? Or this JavaScript function that creates a table of known postal codes for Cambridge, Massachusetts. var CambridgeMA = [ 02138, 02139, 02140, 02141, 02142, 02238 ]; But when you try to use the array, you discover that half of the numbers got corrupted! alert(CambridgeMA.join(" "));
Are space aliens corrupting my data? Here's a clue. If you try to calculate the next month in a batch file set /a NEXTMONTH=%DATE:~4,2%+1 the script generates the following error in August and September: Invalid number. Numeric constants are either decimal (17), hexadecimal (0x11), or octal (021). The answer is that pesky leading zero. (August is month 08 and September is month 09.) Remember octal? I don't. The architectural design of the PDP-8 and other processors of the era made octal a convenient notation for representing values. (This octal-centricness can also be seen in the instruction set of the 8008 processor, which led to the 8080, which led to the 8086, which led to the x86, which led to the x64, and you can still see the octal encoding in the so-called ModR/M and SIB bytes.)
The B language permitted octal constants to be expressed
by prefixing them with a zero.
For example, Nowadays, octal is very rarely used, and as a result, the ability to create an octal constant by inserting a leading zero is now a curse rather than a helpful feature. Now I can tell a joke: My brokerage firm apparently has difficulty printing decimal values on their statements, because a value like "30.038" ends up being printed as "30. 38". I suspect that their reporting program has a function like this: void PrintThousandths(unsigned n) { printf("%d.%3d", n / 1000, n % 1000); } One of my colleagues imagined what the code review must have looked like:
|
Comments (62)
Comments are closed. |
Yowzer. I found exactly this bug as a 3rd line support contractor working in a delicious multi-tier system. (Java front end, talking CORBA to a C++ server, on a DB2 database). The version of C++ the back end was compiled with used the old specification for atoi… and you know the rest.
Once I realized what was going on it was easy enough to fix… Happy days.
I got quite badly stung by this in JavaScript.
There then followed some evil hacky hacks to try and get round it before finding out that there were proper ways to get round it.
I'm not at all surprised at the js zip code but I am surprised that cmd knows octal. It comes in handy do rarely.
Fun fact: In C like languages is literal 0 is in fact octal zero:
int i = 0; // Octal Zero
Not that it makes the slightest bit of difference though.
The ACL groups on VMS were 3 Octal digits (512 total possible groups).
It also doesn't help that apparently every C-inspired language with octal constants still parses numbers in base 10 by default, ignoring leading zeroes:
C: 010 != atoi("010")
Java: 010 != Integer.parseInt("010")
JavaScript: 010 != parseInt("010")
This is why I always explicitly specify base 10 in conversion functions, for clarity.
02138 is not a common way of representing a number. If it's a number, it should be represented 2138. Conversely, and the key point, if it's 02138 then it's not a number. A ZIP code is the string "02138". We see leading zeros being disregarded in Excel, but then it is fair for Excel to be biased towards numbers.
The joke aside, the solution to the above problems is to stop storing non-mathematical data in numeric types. Zip codes are textual data. Phone numbers are textual data. Dates and times are dates and/or times, not numeric values. (Yes, yes, I know that in most cases the various date/time types in various languages/frameworks return numeric values for discrete parts, such as months, minutes, or year. Ignore them, particularly in languages/frameworks that have robust date/time handling functions.)
That's why octals are explicitly prohibited in ECMAScript strict mode – dmitrysoshnikov.com/…/es5-chapter-2-strict-mode
(function f() {
'use strict';
var CambridgeMA = [ 02138, 02139, 02140, 02141, 02142, 02238 ];
alert(CambridgeMA.join(" "));
})()
SyntaxError: Octal literals are not allowed in strict mode.
Visual Studio should generate a warning when it sees a (non-zero) octal constant. That way you can beef it up to ERROR with a #pragma warning ( number : error ) buried in stdafx.h (or windows.h!)
I once worked on the firmware an embedded system that required entry of an IP address using only up, down and enter. That system fixed width entry, so the IP address always required 3 digits for each place, so 192.161.1.1 was entered as 192.168.001.001. We had no idea that the official spec for IP mandated that leading zeros specified octal. That caused no end of confusing for us or our users as IP addresses on this system could not be directly transferred to a PC because of the octal assumption.
For the Cambridge example, how come only half of the values are getting corrupted? All of those values start with 0.
jader3rd – because the rest have the digit "8" or "9" and thus can't be octal.
@jader3rd, because 8 and 9 aren't octal digits. 02141 is a valid octal number; 02138 isn't. (See the ANSI C edition of K&R: "everybody's favorite trivial change: 8 and 9 are no longer octal digits.")
@Dezgeg:
If you don't mind the fact that GCC will give you names you'll have to "demangle" (that's not quite the right term as it's not name mangling in the usual sense), there's a simpler way to do that:
#include <iostream>
#include <typeinfo>
int main() {
std::cout << 4294967295 << " " << typeid(4294967295).name() << "n";
std::cout << 0xFFFFFFFF << " " << typeid(0xFFFFFFFF).name() << "n";
std::cout << 037777777777 << " " << typeid(037777777777).name() << "n";
}
On MSVC 2010:
4294967295 unsigned long
4294967295 unsigned int
4294967295 unsigned int
@Dezgeg: Another quirk of C/C++ is that unsuffixed decimal constants between 2^31 and 2^32 have a different type depending on the language flavor. In C89, they're 'unsigned int', but in C99/C++11 they're 'long' or 'long long', and they're technically undefined behavior in C++03, assuming 32-bit longs (C++03 §2.13.1/2).
See this question stackoverflow.com/…/why-this-is-undefined-behavior for an interesting example of how this breaking change can introduce silent logic errors when compiling code for C89 vs. C99.
@voo: but C# does not have octal literals.
System.Console.WriteLine(12==012);
produces True.
I got bit with this as well. I ended up doing stuff like the following if there was ever a situation where I had to do math on something that might look like it was Octal:
rem Minutes may have a leading zero, and hense look like an octal constant
rem Extract the Minutes (this works)
for /F "usebackq" %%i in (`echotime /M /N`) do set MINUTES=%%i
rem See if math fails on it. If so, assume octal stupidity is the problem.
set /A MINUTES=%MINUTES%
if ERRORLEVEL 1 set /A MINUTES=%MINUTES:~-1%
OK, I'm a bit confused here.
We've have established in the Postal code example, that numbers with a leading 0 but WITHOUT a 8 or 9 are considered octal, but numbers with a leading 0 and WITH a 8 or 9 are considered decimal.
In that case, why is the date (1st half/2nd half) example failing? We are treating the bad octals as decimals, exactly like we want to.
They aren't treated as decimal numbers, they are treated as invalid octal numbers. If you wanted them to be treated as decimal numbers, you would have to remove the leading 0.
@James Curran
I would think the behavior on 'bad' octal constants is heavily language dependent. JavaScript vs CMD, in this case.
@Paul Baker:
Exactly. Programmers / schema designers need to ask themselves this question before they make a field an integer: Will I do arithmetic operations or comparisons against this value? If not then it's a string, not a number.
Social security numbers, phone numbers, zip codes, parcel tracking numbers… none of these should be stored as actual numbers. And as much as I love seeing my phone number printed as 8.0159E9, so many problems could so easily be avoided.
@ch: That's all well and fine, but now please remember how many command-line utils (including ping, traceroute, etc.) were written in C, and used inet_addr for converting the text string to IP-address. And as for why inet_addr recognizes octal and hex numbers — I have no idea.
Yeah, echoing what others have said… just because a value consists of numeric digits, doesn't mean it's a number. If it doesn't make sense to (e.g) increment it by one, what you're dealing with probably should be a string.
> [reviewer] Change %03d to %3d, because %03d will print the result in octal.
I bet there was once a time when "printf format specifications" was one of the most popular MSDN pages. Anyways for some reason the joke had me thinking a reviewer might red out the whole statement with a comment about leading %'s representing binary constants. I had to google to see what language I was remembering or if I was on crack. Turns out it was Pascal (at least the implementation I used very early in school).
The *only* good use for octal constants in this day and age is for the mode parameter of chmod.
I wish that C/C++ had "0d" and "0b" as prefixes akin to "0x" to mean decimal and binary, respectively. Then you could have tables of decimal values with leading zeros for convenience, and the rule about octal wouldn't be so bad. I'd also have "0o" work as a prefix for octal for parity, not that it's really needed.
C++11's custom literal thing only works for suffixes, but it does let you do things like "11011110101011011011111011101111"_b. Unfortunately, Visual Studio 2013 is still behind the other major compilers and doesn't support that.
Fortunately, this can be avoided in a strongly-typed language. But I didn't know about octal myself.
There is an error in: "8008 processor, which led to the 8080, which led to the 8086, which led to the x86 [sic], which led to the x64." "x86"is an error. (All of these are instances of x86.) Should be "i386". Perhaps one day, Mr. Chen, you should write an article on how "x86" by mistake begun to mean "pre-x64 only" in Microsoft terminology. (I imagine Microsoft marketting one day started to tag "x86" to products to mean "they run on a diverse range of x86 processors, not just x64" but something went wrong along they, as with WOW64 and such.)
@Mike, another fun fact: The type of an integer literal in C++ (maybe C too, haven't read that spec recently) can actually change depending on the base of the literal. The difference is that an octal or hex literal will never have type 'signed long' or 'signed long long'.
For example:
#include <iostream>
template<typename T> const char* typeof(T v) { return "unknown"; }
template<> const char* typeof<long>(long v) { return "long"; }
template<> const char* typeof<unsigned>(unsigned v) { return "unsigned"; }
int main()
{
std::cout << 4294967295 << " " << typeof(4294967295) << std::endl;
std::cout << 0xFFFFFFFF << " " << typeof(0xFFFFFFFF) << std::endl;
std::cout << 037777777777 << " " << typeof(037777777777) << std::endl;
}
On AMD64 Linux GCC this prints:
4294967295 long
4294967295 unsigned
4294967295 unsigned
Now, this doesn't affect the literal zero, but yet another subtle cornercase in C++.
This reminded me a lightning talk titled "WAT", very well worth watching. One of the funniest talks I've ever seen (one day I want to make it to CodeMash): http://www.destroyallsoftware.com/…/wat
@lucidfox: Just imagine if it didn't – talk about unexpected behavior.
One of the things where Java (and evil tongues would say as a consequence C#..) followed C too closely sadly. It *does* make sense in C to have octal numbers (well not sure we'd really need it these days), but in a high level language you wouldn't/shouldn't set permissions using magic constants anyhow.
One of the things that python removed with the 2->3 transition, btw:
>>> x = 03
File "<stdin>", line 1
x = 03
^
SyntaxError: invalid token
@SimonRev
Your firmware was right. The official spec for IP has *never* required octal and various documents either implicitly or explicitly forbid it. For example, RFC 790 contains lots of IP addresses written in decimal with leading zeroes. Various other RFCs which contain grammars as part of their protocol definitions permit only four decimal numbers separated by dots (i.e. excluding both octal/hex and the silly notation whereby fewer dots mean that the last number is 16, 24, or 32 bits).
@Myria: In Python, 0o10 == 8. Maybe other languages too.
Fleet Command: well, x86 is not quite a superset of x64. A processor in x64-mode can execute 32bit and 64bit programs, but no longer 16bit. So x64 is an evolutionary step beyond what "x86" (or "80×86") meant for quite a long time before "x64" was a dream :-)
(besides: x64 is incorrect, there never was a processor line "80×64" or so, but intel didn't like that everyone used "amd64" :-))
Engywuck: I thought "x64" was just an abbreviation of "x86-64". MS had to come up with a name for it before AMD officially named it AMD64, so you can't expect the names to match.
Gabe: No the story is opposite, Microsoft came up with the vendor-neutral "x64" later. Parts of Windows still call it "amd64". The problem was AMD's name was too vendor-specific, there was no way Intel was going to call their products "Intel AMD64 processor"
Nicholas: with your criteria, your examples would be be stored as numbers. Tracking numbers contain checksums. A certain type of USPS mail is specified as having a barcode "8982 5000 0000 and higher". Social Security numbers also have numeric validation rules. If you're sorting mail, you're sorting by ranges of Zip codes. At the very least, if you have a numeric piece of data, you need to issue them in sequence, such as how telephone numbers are issued in contiguous blocks to companies.
@lucidfox parseInt("010") == 8 in Firefox 20 and earlier, 10 in Firefox 21 and later. See developer.mozilla.org/…/21
I'm a web developer so usually most of your technical posts about Windows development go over my head but I was quite pleased to read your latest article and recognise what the problem was before you explained it for the first time :)
@Engywuck: Etymology is not really my favorite subject. For whatever reason (right or wrong) "x86" now refers to the architecture of the entire range of CPUs from 8086 to Core i7. Both i386 and x64 are part of that. Occasionally, I have confused people disputing statements about x86 variants of Windows Server supporting more than 4 GB of memory because of the same 32-bit-esque technical misconception.
The other day I saw a notice of the 25th birthday of Tcl, and now I remember that it had gotchas involved with 08 and 09…
@Fleet Command: Well as far as I can see "x86" is not authorative anyhow – Intel at least uses IA-32, respectively Intel 64 for their architectures.
I would agree that "x86" generally refers to the entire architecture, but if accuracy was important, staying far away from those non-official terms seems the best solution.
@Azarien: Mea culpa. Great that the C# team broke this bad design decision!
This is why I always type postal codes as strings. As a side effect it is much easier to extend the system to Canada, where postal codes have letters in them.
@Fleet Command ""x86"is an error. (All of these are instances of x86.) Should be "i386". "
Did you forget about the 80186/8 and the '286? :)
@FleetCommand: 'x86' was coined to represent the 80186/80286/80386/80486 (and briefly the 80586, more correctly Pentium) processor evolution, and as such it was correct IMO.
@Jon:
Yes, some "numbers" contain checksums (for example, credit card numbers). If you want to validate a value with a checksum or do some numeric comparison then you're better off converting the string to a number (using whatever special rules apply for that specific piece of data) and then performing the operation.
Postal codes: international codes (including Canada) often contain letters. Oops.
Tracking numbers: some carrier codes (including UPS) contain letters. Oops.
US Social Security Number: there is some validation you can do, but it is against the individual parts of the SSN. It is a lot simpler to extract the Area Number (first three characters) from string("1234567890") than it is from uint(1234567890).
Jon: As long as "0123456789" are in this order in your locale you can sort your zip codes by string as well as numeric. Perhaps even better as string, since you don't have to "retrofit" leading zeros and can just jeck for stringlength – and you won't have to use mathematical operations like "take cubic root of zip code". Same for the UPS barcode etc.
If you want to do checksums you better convert by the checksumming function to "numeric" and back. See for example the old ISBN "numbers" – the checksum is built by doing "modulo 11", and if the result was "10" you'd add an "X". Yes, the character.
Also assuming that zip codes are always numeric *may* work, if you are *sure* you don't plan to go international – and even then there are pitfalls. Both germanies before unification hat PLZ (Postleitzahlen, sort of like ZIP codes) which were four numeric characters and the post used the characters from left to right for sorting purposes (i.e. 7xxx was southwest germany in the BRD). After unification quite a few cities had the same code assigned, so for a time you had to say "O-3214" or "W-3214". So suddenly you had to have support for a longer PLZ – one non-numeric. Since 1993 it's five numeric characters – including possible leading zeros: 01001 is (part of) Dresden
@Mr. Chen: Your question adds a "that" clause that I did not intend. There are strongly-typed languages with which you'd never have this problem, but not through incompatibility between numeric types. (Indeed, they are saved in memory in a rather standard binary format.) Can I name them or is there a "name no names" rule on this too?
@ Nicholas: Having a data format be numbers lends to numeric sorting. You are comparing apples with oranges. USPS zip codes: 0-4 are East of the Mississippi. Now compare that to French postal codes. Does such an ordering exist? Apparently not.
When using the %DATE% env var you'll run into problems if the regional settings are changed. For example, I like to set the date to "yyyy-MM-dd" so this is what I get:
C:>echo %DATE%
2014-01-17
Here's a better way to get the date in a bat file:
:: Use WMIC to retrieve date and time
FOR /F "skip=1 tokens=1-6" %%A IN ('WMIC Path Win32_LocalTime Get Day^,Hour^,Minute^,Month^,Second^,Year /Format:table') DO (
IF NOT "%%~F"=="" (
SET /A SortDate = 10000 * %%F + 100 * %%D + %%A
set YEAR=!SortDate:~0,4!
set MON=!SortDate:~4,2!
set DAY=!SortDate:~6,2!
REM Add 1000000 so as to force a prepended 0 if hours less than 10
SET /A SortTime = 1000000 + 10000 * %%B + 100 * %%C + %%E
set HOUR=!SortTime:~1,2!
set MIN=!SortTime:~3,2!
set SEC=!SortTime:~5,2!
)
)
Of course this is missing the point about radix notation…
"Go ahead and name a language where decimal constants are incompatible with octal or hex constants."
@Mr. Chen: For the second time, I did not say such a thing. What I said is "this can be avoided in a strongly-typed language." In Delphi, you'd not have this problem because it either interprets one parameter as hexadecimal or decimal. It is one of the following:
* "10" = 10, "010" = 10, "$10" = (Exception)
* "10" = 16, "A" = 10, "$10" = 16, "$A" = 10, "$010" = 16, "$0A" = 10
int32_decimal_t x = 1; int32_hex_t y = 0x10; x = (int32_decimal_t)y;"? (Would
int32_decimal_t
mean that it is a 32-digit BCD value?) What's the point of suggesting something that nobody would use? -Raymond]@Mr. Chen: en.wikipedia.org/…/Straw_man
Maybe you should review my past messages and read what I actually said; and while you are doing it, put everything you know about Microsoft-style C++ programming aside. There is a world outside Microsoft's scope in which things are done differently.
@Fleet Command. Sorry, but you do not make any sense. You speak of strongly TYPED languages. For this to matter hex/octal/decimal constants must have different TYPES. You speak of Delphi. In Delphi it doesn't matter if you write $10 or 16, it's the same value and it has the same type. You just don't have the same problem in Delphi because it doesn't use the octal notation, but it has nothing to do with it being a strongly typed language or not.
@Marcel: Is that the source of the confusion? Wow!
Let me clarify. Batch file is not a strongly-typed language. Strings and numbers don't have different explicit types. C++ and Delphi are both strongly-typed languages because you explicitly define types in them and conversion from string to number requires parsing. Hence, you can tell your parser to convert only to decimal and perform type-checking and criteria-checking. (Hell, I even added an example to that effect above.)
But how did you guys interpret "this can be avoided in a strongly-typed language" into "hex/octal/decimal constants must have different TYPES", I don't know. Did you guys even read the blog post? Seriously, you guys should read more of what is actually written and less of what is not written.
int x[] = { 001, 010, 100 };
. So it is apparently not strongly-typed enough. Is there a language that is strongly-typed enough to detect this error? -Raymond][Is there a strongly-typed language that treats decimal, octal, and hex constants as incompatible types?]
I could swear I've seen such a language, and it looked like a pain to work with. As much a merely a leading 0 for octal is strangely dangerous these days, strong-typing to clear this is not sane.
@Jon: MS are still using the AMD64 as an official name of x86_64, haven't stopped using it.
@Fleet Command: The problem is that even in strongly typed languages like C++ or C#, 10 != 010 – adding that leading zero changes the *value* of the constant, but it doesn't change the *type*. Unless you made Octal and Hex constants have a different *type* then you can't avoid this in languages which choose to use 0 as the prefix for Octal constants. Now the problem could've been avoided if C or any of the C-like languages which have followed had done something sensible and used, e.g. %10 to mean Octal 10 but, alas, none of them have done that.
@AndyCadley: Well said. But "010" in is a string. You can convert it to 10 if you wanted to. (Just call the correct conversion function.) No such luxury with batch files, right?
@Fleet Command: So what you actually propose is to instead of
i = 011;
i = 11;
i = 0x11;
use
i = fleetcommand::tonumber<int>("11", 8); // i == 9
i = fleetcommand::tonumber<int>("11", 10); // i == 11
i = fleetcommand::tonumber<int>("11", 16); // i == 17
?
@Joker_vD: I'm pretty sure he's actually proposing:
i = fleetcommand::numberfrombinary("11")
i = fleetcommand::numberfromoctal("11")
i = fleetcommand::numberfromdecimal("11")
i = fleetcommand::numberfromhexidecimal("11")
and probably no others exist.
Since there are already such functions in existence, I am not proposing anything. Thanks God, they are not part of a class called "fleetcommand"! I won't use a language that has a "fleetcommand" class! (:LOL:)
StrToInt function:
msdn.microsoft.com/…/bb773446%28v=vs.85%29.aspx
@Fleet Command: I actually thought about a namespace. Okay, let me write class "fleetcommand" in Delphi so you can stop using it, will you?