Saturday, April 23, 2016

PHP-FIG, Composer, and other disasters from the last decade of PHP

Let's talk about PHP. The scripting language, not the health insurance. PHP is, in my opinion, one of the greatest development tools ever created. It didn't start out that way, which is where most of its bad rap comes from, but it has transformed over the past decade into something worth using for any size project (and people do!). More specifically, I've personally found PHP to be an excellent prototyping and command-line scripting tool. I don't generally have to fire up Visual Studio to do complex things because I have access to a powerful cross-platform capable toolset at my fingertips. It's the perfect language for prototyping useful ideas without being forced into a box.

BUT! Some people WANT to force everyone into a box. Their box. Introducing the PHP-Framework Interop Group or PHP-FIG. A very professional sounding group of people. They are best known as the folks who produce documents called PHP Standard Recommendations aka PSRs. This group of 20 or so people from a wide-range of very popular projects have gotten together to try to work out some of the problems they have encountered when working with PHP. Their goal is simple:
"The idea behind the group is for project representatives to talk about the commonalities between our projects and find ways we can work together. Our main audience is each other, but we’re very aware that the rest of the PHP community is watching. If other folks want to adopt what we’re doing they are welcome to do so, but that is not the aim. Nobody in the group wants to tell you, as a programmer, how to build your application."
No, "We'll just let everyone else tell you how to build your application." At least that's the implication and it certainly is what seems to be happening.

There's nothing wrong with having Standards. In fact, I'm a strong advocate of them. What I'm NOT an advocate of is being told that my code has to be written a specific way by clueless people who blindly follow PHP-FIG PSRs without understanding where they are coming from. The worst offender is pretty everyone in the Composer camp. In software development, the more dependencies you have, the more likely it is that your project will break in spectacular ways. And, as we all know, everything breaks at the most inopportune times. Composer takes that concept to its extreme conclusion and introduces the maximum amount of dependencies into your software project all at once. No thank you very much. Correct software development attempts to reduce dependencies to the bare minimum to avoid costly breakages.

Composer exists because PSRs and lazy programmers who don't know how to develop software exist. PSRs exist because PHP-FIG exists.

The worst PSR in PHP-FIG is PSR-4, formerly PSR-0: The autoloader. As hinted by the zero (0) in "PSR-0", it was the first accepted "Standard" by PHP-FIG - and I use the word Standard loosely here. The concept of the autoloader stems from a very broken portion of PHP known as a namespace. In most normal programming languages that implement namespaces, the idea is to isolate a set of classes or functions so they won't conflict with other classes and functions that share the same name. Then the application developer can choose to 'import' (or 'use') the namespace into their project and the code compiler takes care of the rest at compile-time - all the classes and functions of the whole namespace become immediately available to the code.

That sounds great! So what could possibly go wrong?

In PHP, however, namespaces were only halfway implemented. PHP developers have to declare, up front, each class they want to 'use' from a namespace to simplify later code AND manually load each file that contains the appropriate class. This, of course, created a problem - how to get the files to load that contain the code for the class without writing a zillion 'require_once' lines? Instead of correctly implementing namespaces and coming up with a sane solution, a hack was developed known as __autoload() and later became a formalized hack known as spl_autoload_register(). I call it a hack because the autoloader is effectively an exception handler for a traditional code compiler - something no one in their right mind would ever write. With an autoloader, at the very last moment before PHP would throw up an error about a missing class, the autoloader catches the exception and tells PHP, "Oh never mind about that, I got it." Thinking about all of the backend plumbing required to make THAT nonsense happen (instead of correctly implementing namespaces in PHP) makes my head hurt.

Exception handlers, when written correctly, do nothing except report the exception upstream and then bail out as fast as possible from the application. Exceptions happen when an unrecoverable error condition occurs. Good developers don't try to recover from an exception because they realize they are in a fatal, unrecoverable position. (This is why Java is fundamentally broken as a language and a certain company that shall not be named made many terrible decisions to ultimately select Java as their language of choice for a certain popular platform that shall also not be named.)

Instead of fixing the actual problem (i.e. broken namespace support), us PHP userland developers get the autoloader (i.e. a hack). Composer and its ilk then builds upon the broken autoloader concept to create a much larger, long-term disaster: Shattered libraries that have dependencies on project management tools that someone may or may not want to use (Hint: I don't) and dependencies on broken implementations of certain programming concepts that should be fixed (i.e. PHP namespaces. By the way, don't use things that are broken until they have been fixed - otherwise you end up with hacks).

Another problem lies in the zillions of little files that PHP-FIG PSRs have directly resulted in (e.g. insane rules like "every class MUST be in its own file"), which results in huge increases in upload times to servers over protocols like SFTP (and FTP). What is known as a "standalone build" is pretty rare to see these days. A standalone build takes all of the relevant files in a project and merges them into one file. A good standalone build tool also allows users to customize what they receive so the file doesn't end up having more bloat than what they actually need.

Congratulations PHP-FIG: You've successfully exchanged one problem (i.e. poorly written classes) for a different problem (i.e. poorly written classes spanned across hundreds of files with massive, unnecessary dependencies that take forever to upload and rely on broken-by-design non-features of PHP).

Friday, April 22, 2016

Need a random TCP port number for your Internet server application?

When writing a TCP server, the most difficult task at the beginning of the process is deciding what port number to use. The Transmission Control Protocol has a range of port numbers from 0 to 65535. The range of an unsigned short integer (2 bytes). In today's terms, that is a fairly small space of numbers and it is quite crowded out there. Fortunately, there are some rules you can follow:

  • Specifying port 0 will result in a random port being assigned by the OS. This is ideal only if you have some sort of auto-discovery mechanism for finding the port your users are interested in (e.g. connecting to a web server on port 80 and requesting the correct port number). Otherwise, you'll have to occupy an "open" port number.
  • The first 1023 port numbers are reserved by some operating systems (e.g. Linux). Under those OSes, special permissions are required to run services on port numbers under 1024. Specifically, the process either has to have been started by the 'root' user OR been declared okay by the 'root' user with special kernel commands (e.g. setcap 'cap_net_bind_service=+ep' /path/to/program).
  • Of the remaining port numbers, huge swaths of the space have been used for what are known as Ephemeral Ports. These are randomly selected and used for temporary connections. 1024-5000 have been used by a number of OSes. IANA officially recommends 49152-65535 for the Ephemeral Ports.
  • Port 8080 is the most common "high" port that people use (i.e. alternate web server port). Avoiding that is a good idea.
Having all of that knowledge is useful but what you need is a random number generator that can generate that port number for you. Without further ado, your new port number awaits your clickity click or tappity tap:

Get a random TCP port number

As long as it doesn't output 8080, you are good. If it does output 8080 and/or stomping on IANA assignments bothers you, reload the page and try again.

Friday, January 01, 2016

2015 Annual Task List Statistics

At the end of last year, I decided to start collecting some statistics about my ever-changing software development task list. To do that, I wrote a script that ran once per day and recorded some interesting information from my task list manager (a flat text file) and the number of open tabs in Firefox. What follows are some interactive (oooooh, shiny!) charts and some analysis:

The number of tasks on my task list peaked twice this year at 78 tasks and dropped one time to 54 tasks. The number of tasks appears to be decreasing according to the trend line in the first chart. However, the second chart tells a slightly different story. Even though the number of tasks is on the decrease, the file size of the text file in which the tasks are stored is apparently on the increase. This tells me that the overall complexity of each individual task is slightly higher or I'm just slightly better at documenting details so I don't forget what the task entails (or some combination of both).

The final chart is probably the most interesting and perhaps the most telling. It shows how many open tabs I have in Firefox. Firefox is my primary web browser in which I do all of my research for my software development. I tend to close a bunch of tabs around the time I release a new version of my software. As a result, I figured it would be a good measure of my development habits. During the early part of the year, I got the number of open tabs down to 70. And that's the lowest it went. At the early part of December, however, the number of open browser tabs dramatically spiked to 244 and dropped to 90 open tabs just a couple of weeks later. If you look at my forum activity around the time the tabs dropped, you see a correlation to when I teased a new piece of software and how much I really dislike certain aspects of the Windows API. About 150 browser tabs were open for various bits of information to help construct a brand new piece of software. The overall trend line for browser tabs is, of course, on the increase. I have a feeling, based on the official 2016 CubicleSoft project list, that the increasing line will remain the trend.

There are other drops in task and tab counts that correlate to various software releases. I'm personally most interested in overall trends. I really want to see the number of tasks diminish over time. I'd like to see complexity on the decrease too. It would be awesome to completely wipe out all of my browser tabs. None of those are particularly realistic, but I can dream. There are projects I need to do to get to a point where I feel like software has stabilized.

(By the way, I'm aware the spreadsheet that the data is in is public. There's not much else in the data beyond what is seen in the charts but maybe someone will come up with some additional and interesting anecdotes.)

Saturday, November 21, 2015

Why developers should do their own documentation and code samples

I was recently on the Microsoft Developer Network website (aka MSDN) looking at some API documentation. Many of the more popular APIs have code examples so the developer can see example usage rather than have to try to understand every nuance of the API before using it. The particular API that I was looking to use had an example, so I made the unfortunate decision to look at the code. The example was a turd. It wasn't a polished turd. It was just a normal, run-of-the-mill turd. The code had HANDLE leaks, memory leaks, and a bunch of other critical issues. It looked like it was written by a 20 line Norris Number programmer (aka newbie).

Being rather bothered by this, I set out to learn how Microsoft produces its code samples. According to one source I found, the company hands the task off to interns. So, sample code that a whole bunch of other programmers are going to simply copy-pasta into their own code is being written by amateur programmers. Nothing could possibly go wrong with that. If the examples are indeed written by interns, it certainly explains why the quality of the code samples in the documentation is all over the map ranging from really bad to barely passable. It's certainly not what I would expect from a professional organization with 50,000 employees. If you open a HANDLE, close it. Allocate memory? Free it. Simple things that aren't hard to do but help achieve a level of professionalism because you know that other people are just going to copy the example into their code, expect it to work, and not have unforeseen bugs in production.

MSDN is the face of Microsoft most people don't really get to see unless they start developing for the Windows OS. But it matters who produces the documentation because a single mistake is going to affect (tens/hundreds of) thousands of applications and millions (billions?) of people. API documentation is almost always too intricate for most other developers to fully understand. While it is the be-all-end-all definitive overview of any give API call, code examples provide context and meaning. A lot of people struggle with "so if I use this API, what do I do next" but have the "aha!" moment when they see a working example connecting the API to other code. Developers will copy and paste an example long before they fully comprehend any given API. For this reason, code examples need to have the same care and professionalism applied to them as the API itself. Passing this responsibility off to an intern is going to create significant long-term problems.

Writing your own code examples for an API also has the benefit of revealing bugs in the API. If the developer who made the API is writing the documentation for it and the code sample, they are 15 zillion times more likely to spot mistakes and correct them before they get released into the wild. Pass that responsibility off to an intern? Well, the intern is going to not run into or just ignore the bugs in the API because THEY DON'T CARE. They want the paycheck and the checkmark on their graduation forms that says they did their internship. Users (developers) have to live with the disaster that interns leave behind in their wake. Putting them on documentation and code example writing tasks means interns will be the face of the company that developers (i.e. the people who matter the most) will see. That strikes me as unprofessional.

In short: Develop an API? Do your own documentation and code sample writing. Is it tedious and boring? Yes. But it is important to do it anyway. In fact, it is infinitely more important than the API you wrote.

Thursday, November 12, 2015

Let's NOT Encrypt - Critical problems with the new Mozilla-sponsored CA

Starting a new Certificate Authority is a time-consuming, expensive, and difficult task. It is also annoying to set up and maintain SSL/TLS certificates. So I completely understand what Let's Encrypt is trying to do. Their goal? Free, functional SSL/TLS certificates that are easy to create, install/deploy, and even keep up-to-date. What's not to like about that? Well, it turns out there are some serious problems with this up-and-coming Certificate Authority (CA). I'm going to list the issues in order of concern:

  1. Doesn't solve the problems of storing roots in the browser or global trust issues.
  2. A U.S.-based company.
  3. Browser support/acceptance.
  4. Sponsored by Mozilla.
  5. Other, publicly traded, corporate sponsors.
  6. A brand-new, relatively untested, and complex issuance protocol (ACME).
  7. Limited clients (Python bindings only) and no libraries.
  8. Linux only.
Each of these issues in detail:

For the first issue, even though it is all we have got, SSL/TLS is fundamentally broken. Let's Encrypt builds upon broken technology and is therefore also fundamentally broken. Instead of fixing the core problem, it merely obscures it. We need to scrap the current mess and start over, using the understanding of what we have learned over the years, not bury broken technology with more broken technology - see the spam in your in-box to learn how well that's worked out for you. Distributed authorities and/or trusted peering, sensible user-presentations (instead of today's scary-looking warning dialog boxes), NOT distributing default roots (we shouldn't even have root certificate stores - it should be root-per-domain), and web of trust are better steps in the right direction and lets people do things with certificates currently not possible (e.g. issuing their own signed cert chains without raising warnings), and possibly redesigning portions of TLS from the ground-up. Ultimately, each individual and company should be able to be their own CA free and clear on the Internet for true Internet security.

For the second issue, Let's Encrypt is a U.S.-based company. They proudly display that information when they say they are a non-profit 501(c)(3) organization. This is a HUGE problem because being a U.S.-based company makes that company susceptible to secret FISA rulings. As a result, a FISA court could order them to turn over their root certificates AND not say a word to the public with severe penalties if they violate the ruling. FISA courts are in cahoots with the NSA, CIA, and FBI and rarely rule in favor of companies or citizens. Until this relationship is resolved amicably (e.g. dissolve/neuter FISA and reset all root certs), it is extremely dangerous to have a Root Certificate Authority operate within U.S. borders.

For the third issue, Let's Encrypt has a huge uphill battle to get added to the root certificate store of every major browser and OS. StartCom, an Israeli-based company which also offers free domain validated certificates today via StartSSL, took years to get through the process to be added to browser and OS root certificate stores, and then even longer to get enough market share to be deemed viable for use. Let's Encrypt has to go through the same process that StartCom did, which means they are about 5 years away from viability. The only positive side to Let's Encrypt is they plan to offer free certificate revocation, whereas StartCom does not. Again, all of this process is required because, as the first issue pointed out, SSL/TLS is broken technology. Instead of fixing SSL/TLS, they opted to adopt it.

For the fourth issue, Mozilla appears to be the primary sponsor. Mozilla makes Firefox and they now basically own Let's Encrypt. It smacks of collusion and that can be quite dangerous. It certainly will be extremely suspicious if Mozilla is the first to adopt the Let's Encrypt root into the root certificate store of Firefox. Browser/OS vendors seem to wait until someone else includes the root first, so this is highly advantageous for Mozilla because they can artificially accelerate the process. If they pull such a stunt, it could result in a lawsuit from other CAs who had to go through the extended process and/or extremely ironic antitrust litigation against Let's Encrypt and Mozilla by the Department of Justice. I say ironic because Mozilla used to be Netscape, who was the source of antitrust litigation against Microsoft when they bundled Internet Explorer with Windows back in the day. Mozilla getting slapped with antitrust litigation would be the most entertaining thing for us tech watchers that could happen - if that happens, grab your popcorn and sit back and enjoy the show!

For the fifth issue, while I understand that a public Root Certificate Authority is expensive to start (estimated initial costs are at least $50,000 USD) and that corporate sponsors have that kind of money, it is rather inappropriate. There needs to be complete, full transparency with regards to the money here. It is extremely important during the setup phase of a CA like this. As far as I can tell, the project is distinctly missing that information. Also their financials aren't readily available online on their website despite being a non-profit organization that claims to increase web friendliness. According to Charity Navigator, they have collected about $100,400 to date, which is on par for starting up a CA.

For the sixth issue, the ACME protocol is a draft specification that I assume will eventually be sent to the IETF. However, it forms the basis of Let's Encrypt. It's a beta protocol and subject to change. As a software developer, I also feel like it is overly and unnecessarily complex as most IETF documents are wont to be. There are a number of issues with the ACME protocol that I feel are vague and therefore open to interpretation. As a counter-example, JSON-Base64 is NOT open to interpretation - it is an extremely clear file format and defers entirely to the TWO nearly identical, official public domain implementations of the library if there is any doubt as to how an implementation MUST implement JSON-Base64. As a result, there is no doubt about how JSON-Base64 works. This, of course, leads me to the next issue...

For the seventh issue, additional clients in other programming languages and libraries to talk ACME may come. Eventually. I have a serious problem with writing a spec before writing an implementation: Real implementations reveal flaws in the spec and updating the spec after it is written is always a low priority. Whereas writing the spec afterwards results in a clean, clear document that can defer to the implementations. Always write general guidelines for the implementation, THEN develop a couple of nearly identical implementations in a couple of different languages, hammer out the bugs, and FINALLY write the final specification based on the implementations BUT defer to the implementations. As usual with IETF related cruft that gets dumped into the wild, the reverse has been done here and this annoying habit results in inevitable problems later on. Again, see the spam in your in-box - you can thank the IETF for that. Tightly-controlled implementations first, specification second.

For the last issue, I give a great, big sigh with gentle facepalm. The authors claim a Windows Powershell solution is coming but that ignores, well, pretty much everything rational. Are they going to support Portable Apache + PHP + Maria DB too? People who develop first for Linux almost always leave cross-platform development as an afterthought and up to other people to resolve because they are too lazy to do the right thing. It's a shameful practice and there should be great amounts of public humiliation heaped on anyone who does it. Windows still dominates the desktop market share, which is where local corporate development boxes live. To choose to ignore the platform users actually work on is just plain stupid. The more critical issue is that supporting only web server software is going to result in headaches when people want it to work for EVERY piece of SSL-enabled software (e-mail servers, chat servers, etc) and supporting just a few products has opened a can of worms they can't close. The Let's Encrypt developers will forever be running around getting nothing of value done.

At the end of the day, Let's Encrypt solves nothing and creates a lot of unnecessary additional problems. It's also a long way off from being viable and there are plenty of legal landmines they have to navigate with extreme care. In short, I'd much rather see a complete replacement for the disaster that is SSL/TLS. Also, people need to stop getting so excited about Let's Encrypt, which simply builds upon fundamentally broken technology.

Saturday, September 12, 2015

GitHub commits publicly reveal your private life

GitHub is a great tool. It enables software developers to work together on open source projects. That's pretty awesome. However, it also unfortunately exposes your personal life to the entire world. It is easy to look at the history log of commits for any given GitHub user and identify their an incredibly creepy level.

Using GitHub histories, an attacker can identify when you are probably awake, asleep, at home, and at work. They can also identify habits such as what days of the week you tend to commit code. As well as what days of the week you never commit code. Which days and months you commit the most code and which days and months you do not as well as the frequency of commits. All of that information can be used to derive your physical location in the world, your religion, your favorite sports team(s), and your relationship status with your significant other (if you are on good terms or not, having sex or not, etc). And possibly your hobbies and general interests.

If you don't think your personal data isn't already being mined for the above, you are quite mistaken. It is.

In my opinion, commit timestamps are a security vulnerability. Let's say an attacker wants to "send a message" to a software developer they don't like. They simply figure out when the person is going to be away from their home, show up, do their thing (tag/graffiti, rob/steal/destroy property, drop a threatening letter, etc), and leave. GitHub commit timestamps provide a wealth of information and, according to the field of statistics, an attacker only needs 35 data points to achieve what is known as "statistical significance". Each commit timestamp is a valuable data point. Therefore, all someone needs is 35 commits to start building a profile. So the attacker may notice that the commit history is devoid of commits during the week Monday through Friday during "normal" business hours when mapped to a specific timezone (i.e. narrowed to a specific region of the world). They can reasonably assume that the target is at some form of a day job. Physical addresses are pretty easy to obtain when the real name is acquired, so the commit history just confirms what is already published information. The lack of commits is information that is just as important as the actual commits.

More data points simply improves the accuracy of the information. Thus, the more frequently you commit, the more information about your personal life that you give away! At some point, with enough commits, everything about your personal life can be determined. Oh, you don't commit code whenever a specific sports team plays a game and it airs on television? You might be a fan of that team because humans are literally incapable of doing two things at once - despite the fact that some people that claim the contrary humans are single-taskers. When you are committing code, you are in front of a computer screen and focused on that singular task. Your favorite TV shows are also possibly able to be determined with GitHub commits simply because you aren't committing code during the time you watch those shows - although if you use Netflix, Hulu, etc., then that information can be a lot harder to determine but you generally won't be committing code while watching any given show.

So how should this be fixed? First off, the entire world doesn't need to see commit timestamps. Timestamps should only be accessible to trusted users and services. I realize timestamps are part of the commit log, so Git itself will have to be changed to accommodate fixing this issue. Second, there should be tight controls over how much timestamp information is disseminated even to trusted users and services. And finally there should be a timestamp privacy (mangling) option to set commit timestamps to specific/random times according to a ruleset that the committer makes (e.g. hardcode all commits to Mondays at random times of the day regardless of the fact the code was committed on Thursday at some specific time of the day).

Wednesday, July 22, 2015

Solving "unresolved external symbol ___report_rangecheckfailure" Visual Studio linker errors

Let's say you import a library from Visual Studio 2012 or later into your project in an older version of Visual Studio (e.g. Visual Studio 2008 or Visual Studio 2010) but now get linker errors:

error LNK2019: unresolved external symbol ___report_rangecheckfailure referenced in function ...
error LNK2001: unresolved external symbol ___report_rangecheckfailure ...

Sad day. Especially since you don't really get a say in how that library is being built. Your options are:

  1. Upgrade your version of Visual Studio. That includes going through the whole project upgrade cycle. We know how well that usually goes.
  2. Recompile the library yourself. Sad day turns into sad week.
  3. Hack it.
The function __report_rangecheckfailure() is called when the /GS compiler option is used. The option enabled buffer overflow security cookie checking, which, in this day and age, is a good option to have enabled. Unfortunately, that causes problems with older versions of Visual Studio. Let's take a look at what the function does - the source code from 'VC\crt\src\gs_report.c' has this code:

// Declare stub for rangecheckfailure, since these occur often enough that the code bloat
// of setting up the parameters hurts performance
__declspec(noreturn) void __cdecl __report_rangecheckfailure(void)
Hmm...not really helpful since it calls another function. However, that function contains this very interesting comment after a lot of inline assembler and macros:

     * Raise the security failure by passing it to the unhandled exception
     * filter and then terminate the process.
So, knowing this normally triggers an unhandled exception and exits the process, we can hack it:

__declspec(noreturn) void __cdecl __report_rangecheckfailure(void)
I'm not sure whether to congratulate myself on this evil hack or cry. I think I'll do a little of both. You're welcome.

Oh, and if you work on the Microsoft Visual Studio development team, please develop a compatibility library that implements stuff correctly for older Visual Studio environments. Doesn't have to go back to 1995 VS6, but something reasonable like a 10 year window that addresses issues like these.