Thursday, December 22, 2016

Virtual Private Servers (VPS) and Cloud hosting are now viable

For many, many years, I was a massive fan of dedicated web hosting. I was VERY vocal about how you couldn't run a legitimate, professional business without using dedicated web hosting. And time and time again, I was proven right as people on shared web hosting came out of the woodwork in various places who had bet their business on shared hosting and lost - and sometimes they lost EVERYTHING including their business and all their customers!

Shared web hosting is still the bottom of the barrel, scummy/scammy money grab that it has always been and no respectable business should be caught dead running their web infrastructure on it. Period. That hasn't changed.

However, I have been watching a couple of new stars grow from infancy into its own over the past 8 years: Virtual Private Servers, aka VPS, and its newer, shinier cousin Cloud Hosting.

Dedicated web hosting is expensive. It has always been because you get a piece of hardware, a network drop, electricity, a transfer limit, a SLA (e.g. 99.9% uptime guaranteed), and a contract. On shared hosting, you can't do whatever you want and will be given the boot if you try to do much of anything with it. However, in the case of dedicated hosting, the CPU, RAM, and hard drive are yours to do whatever you want with it. Entry level dedicated servers start around $60/month for a 1 to 2 year contract period and rapidly go up to hundreds of dollars per month for more beefy hardware. But you can run as many websites on a single piece of hardware that you are comfortable with running on that one server. If you are looking for dedicated hosting, I still recommend 1&1 Dedicated Hosting.

Virtual Private Servers (VPS) have generally been a cheaper option, but with the serious caveat that you can't do much with them. It's a blend between shared hosting and dedicated hosting. You get a mostly isolated OS instance (inside a virtual machine) but you share the same physical hardware with other virtual machines. Most VPS providers charge about half the cost of a dedicated server with a fraction of the CPU and RAM. The OS that runs on a VPS shares I/O resources with other virtual machines on the same host. That is still a problem but is being worked on. CPU cores, RAM, and network bandwidth are isolated these days but I/O requests to physical media (i.e. hard drives/SSD) are not. This means that one virtual machine can still potentially starve other machines on the same host. It's a problem still being worked on. Many VPS providers have also moved to using SSD instead of hard drives, which helps with reducing I/O overhead time. For consumers, most VPS providers simply aren't cost effective - the amount of hardware (CPU, RAM, storage), or lack thereof, is usually the bottleneck - and so most web sites won't run on most VPS infrastructure very well. Some early adopter VPS providers started offering expandable VPS solutions after a while, which are the precursors to Cloud hosting options.

Cloud hosting is the newer shinier kid on the block. When it first came out, it was generally more expensive than dedicated servers. I kept an eye on it but pretty much wrote it off as a toy that would take years to mature. The idea is simple - separate hardware interests so that data can be floated around and attached and detached at will and massively replicate data around the globe so that it is readily accessible at the closest point to the user. In addition, additional hardware resources can be attached and detached at will and/or migrated as usage rises and falls. Implementation of that is difficult to achieve and the tools to build and maintain that sort of infrastructure at first didn't really exist. The tools eventually were developed and have matured over many years (as I predicted) and Cloud hosting has also subsequently matured. It is still more expensive to deploy than a VPS and can still be more expensive than dedicated hosting.

So, why write this post? As my dedicated hosting contract reached its end of life this year, I started looking around at my options. I was quite aware of Digital Ocean, which has an amazing programmable API for spinning up and down instances (Droplets) and doing all sorts of crazy things with virtual machines. After a lot of research and personal fiddling around with their API, I have more or less decided that Digital Ocean is only good for temporary, toy instances where you've got an idea you want to try out before deploying it for real. The general consensus I've seen in the larger community is that people shouldn't try running a real website on Digital Ocean or Amazon AWS. Amazon AWS, while similar, is also more expensive than Digital Ocean and both AWS' cost calculators and Amazon's horrible, convoluted Console, API, and SDK will drive you up the wall.

Then, after a lot more searching, I finally discovered OVH VPS and OVH Cloud Hosting. For ~$13.50/month, a fraction of the cost to get the same setup elsewhere in VPS land and beating some low-end dedicated hosting hardware-wise, OVH provides a fully functional 2 core VPS with 8GB RAM and enough monthly transfer for most businesses. Their offerings here, hands-down, absolutely crush Digital Ocean - the importance of having enough RAM overhead to actually do things can't be overemphasized! In addition, for half the cost of a low-end dedicated server, OVH's lowest end Cloud hosting option completely blows most low-end dedicated hosting out of the water in terms of hardware specs and scalability readiness. I honestly don't know how they are managing to do that and still turn a profit - based on some recent-ish server blades I've seen pop up, I have a few ideas but, even then, margins per blade are thin. The ONLY downsides to using OVH is that they don't have automatic billing/renewal capabilities for their VPS and Cloud hosting options and I also had to come up with an alternative to my previous firewall solution. The Canadian company has been around for so long that their payment system still uses CGI scripts to process payments (there's an early 2000's throwback for you). OVH should scrap their current payment system and use Stripe for a flexible PCI compliant payment solution, which also happens to be what Digital Ocean currently uses.

In short, I stopped using 1&1 at the end of my contract period and have been quietly using OVH for many months now. My costs are significantly reduced but the hardware isn't as robust as before (to be expected - I went from an 8 core dedicated to a 2 core shared system) and not having automatic billing is a tad irritating. OVH makes it easy to renew, but no one should have to do manual renewals of a standard service for a wide variety of reasons - the least of which being that everyone else in the industry offers automatic renewals and they are the weird ones here. 1&1 has a configurable Cisco firewall for their dedicated server products that worked quite well - one of the reasons I stuck with them for so long. So I also now deploy and use good iptables rules and the Web Knocker Firewall Service for a powerful firewall combo that is basically fire-and-forget and superior to most firewall setups.

Update March 2017: I recently discovered that OVH VPS servers have an IP level firewall available. Using it supposedly also helps keep a VPS from bouncing onto and off of their DDoS infrastructure, which has some issues of its own. I still believe Web Knocker is a better solution for a more refined firewall, but having dedicated upstream hardware is a nice addition.

You can still buy a brand new Dot Matrix printer...

Today, I learned that people still buy brand new dot matrix printers. You know, those extremely noisy printers I thought we ditched as soon as it was possible to do so. Well, except for the nutcases who turn them into "musical instruments" and start a YouTube channel:

But, no, sales of brand new(!) dot matrix printers are apparently still, relatively-speaking, alive and well:

Dot matrix printers on Newegg

After doing some research, it turns out that, for bulk printing where output quality and "professional" appearance doesn't matter at all, dot matrix printers can be anywhere from 4 to 8 times cheaper than laser printers per printed page (the next cheapest technology) when amortized over the cost of maintenance of the lifetime of each type of printer. With dot matrix, you're not going to get the speed, accuracy, or the quietness of laser, but you'll supposedly save a boatload of money on toner.

Maybe one day we will get a printer that combines the best of all printing technologies in one compact, affordable device: Dot matrix, laser, inkjet, 3D, and a bunch of other print heads. Ideally a single device that won't care about and automatically adapt to the type of material being printed on, including bulky and strange shapes. And also doesn't fall apart after two months of use and doesn't cost an arm and a leg to maintain (inkjet printers - I'm looking at you). Basically, I'm asking for a StarTrek replicator.

I don't ask for much.

Tuesday, December 13, 2016

Bulk web scraping a website and then reselling the content can land you in legal hot water

This interesting article on web scraping just came to my attention:

New York Times: Auction Houses Face Off in Website Data Scraping Lawsuit

Summary: An auction house in New York is suing an auction house in Dallas for copyright law violations regarding scraping the New York auction house's website listings including their listing photos and then SELLING those listings and photos in an aggregate database for profit.

As I'm the author of one of the most powerful and flexible web scraping toolkits (the Ultimate Web Scraping Toolkit), I have to reiterate the messaging found on the main documentation page: Don't use the toolkit for illegal purposes! If you are going to bulk scrape someone's website, you need to make sure you are free and clear legally for doing so and that you respect reasonable rate limits and the like. Reselling the data acquired with a scraping toolkit seems like an extremely questionable thing to do from a legal perspective.

The problem with bulk web scraping is that it costs virtually nothing on the client end of things. Most pages with custom data these days are served dynamically to some extent. It takes not insignificant CPU time and RAM on the server side to build a response. If one bad actor is excessively scraping a website, it drags down the performance of the website for everyone else. If a website operator starts to rate limit requests, then they will run into additional technical issues (e.g. they'll end up blocking Googlebot and effectively delist themselves from Google search results).

The Ultimate Web Scraper Toolkit is capable of easily hiding requests in ways that mimic a real web browser or a bot like Googlebot. It follows redirects, passes in HTTP Referer headers, transparently and correctly handles HTTP cookies, and can even extract forms and form fields from a page. It even has an interactive mode that comes about as close to the real thing from the command-line that a person can get in a few thousand lines of code. I'm only aware of a few toolkits that can even come close to the capabilities of the Ultimate Web Scraper Toolkit, so I won't be surprised if it comes to light that what I've built has been used for illegal purposes despite the very clear warnings. With great power comes great responsibility and it appears that some people just aren't capable of handling that.

Personally, I limit my bulk scraping projects and have only gotten into trouble one time. I was web scraping a large amount of publicly available local government data for analysis and the remote host was running painfully slowly. I was putting a 1 second delay between each request, which seemed reasonable but apparently my little scraper project was causing serious CPU issues on their end and they contacted me about it. The correct fix on their end was probably to apply a database index but I wasn't going to argue for that as I had already retrieved enough data at that point anyway for what I needed to do and so I terminated the scraper to avoid upsetting them. Even though I was well within my legal rights to keep the scraper up and running, maintaining a healthy relationship with people who I might need to work with in the future is important to me.

Here's an interesting twist: Googlebot is the worst offender when it comes to web scraping. I've seen Googlebot scrape little rinky-dink websites at upwards of over 30,000 requests per hour! (And that's restricting looking at only requests in Googlebot's officially published IP address range in the server logs.) I'd hate to see Googlebot stats for larger websites. If you don't allow Googlebot to scrape your website, then you don't get indexed by Google. If you rate limit Googlebot, you get dropped off the first page in listing results on Google. Ask any website systems administrator and they will tell you the same thing about Googlebot being a flagrant abuser of Internet infrastructure. But no one is suing Google over the extreme abuse of our infrastructure because everyone wants/needs to be listed on Google so that various business operations function, but at least one auction house is happy to sue another auction house while, at the same time, they both are happy to let Googlebot run unhindered over the same infrastructure. I believe FCC Net Neutrality rules could be used as a defense play here to limit damages awarded - IMO, the Dallas auction house, if they did indeed do what was claimed, violated copyright law - reselling scraped photos is the worst part. Then again, so is Google and you simply can't play favorites under the FCC Net Neutrality rules. Those same listings are certainly web scraped and indexed by Google because they want people to find them, which means Google has similarly violated copyright law to be able to index the site. Google then resells those search results in the form of the AdWords platform and turns a profit at the expense of the New York auction house. No one's batting an eyelash at that legally questionable conflict of interest and the path of logic to this point is arguably the basis of a monopoly claim against Google/Alphabet. IMO, under FCC Net Neutrality rules, unless the New York auction house (Christie's) similarly sues Google/Alphabet, they shouldn't be allowed to claim full copyright damages from the Dallas auction house.

At any rate, all of this goes to show that we need to be careful as to what we scrape and we definitely don't sell/resell what we scrape. Stay cool (and legal) folks!

Friday, December 09, 2016

Setting up your own Root Certificate Authority - the right way!

Setting up your own Root Certificate Authority, aka Root CA, can be a difficult process. Web browsers and e-mail clients won't recognize your CA out-of-the-box, so most people opt to use public CA infrastructure. When security matters, using a public CA is the wrong solution. Privately owned and controlled CAs can be infinitely more secure than their public counterparts. However, most people who set up a private CA don't set up their CA infrastructure correctly. Here is what most private CAs look like:

Root CA cert -> Server cert

This is wrong because the server certificate has to be regenerated regularly (e.g. annually). If the root certificate is compromised, then it involves fairly significant effort to replace all of the certificates, including the root. What should be built is this:

Root CA cert -> Intermediate cert -> Server cert

In fact, this is the format that most public CAs use. The root CA cert is generated on a machine that isn't connected to any network. Then it is used to generate any necessary intermediate certs. Both the root and intermediates are signed for long periods of time - typically about 10-30 years for the root and 2-5 years for the intermediates. The root CA certificate private key is then physically secured - a physical vault of some sort helps here. The root CA is never, ever used on a network-connected machine. It is always used offline and it is only ever used to generate intermediate certificates. Only when specific conditions are met is the private key ever accessed. Usually those conditions entail a paper trail of accountability with multiple people who are physically present and are authorized to access the key for purposes declared in advance.

After the root CA generates the intermediate certificates and is secured, the intermediate certificates are then used to generate other certificates. The intermediates can possibly sit on a network connected machine, but that machine is behind a very well-guarded firewall.

So how does one create this setup, while implementing it easily and securely, AND relatively cheaply? First, you are going to need a few things:

Wall-powered USB hub with at least 4 open ports (you can't use built-in USB ports on a computer for this since they are usually underpowered - blame your motherboard manufacturer)
Six USB thumbdrives (ideally, brand new - they can have tiny storage and therefore be cheap - if you are doing this for personal use only, you can scale it back to three thumbdrives)
VirtualBox or VMWare
An ISO of a public Linux distro of some sort - smaller is better (e.g. Tiny Core Linux)
1 CD-R + CD burner
Labels and a pen or marker
Multiple, physically secure locations for storing the various thumbdrives

If you are familiar with virtual machines and constructing a root CA the right way, perhaps you can see where this is going. Hopefully, you will at least agree with me that where I'm going with this is a step above what most people do for their root CA and it is FAR easier and cheaper to isolate/secure a USB thumbdrive than it is a whole computer system.

Now let's get this started:

Make sure everyone is in the room who needs to sign off on the construction of the root CA and they are well-watered/coffee'd and have been to the restroom recently. They aren't going anywhere for a while. If someone leaves during this time, there's a pretty good chance the entire process will have to be restarted depending on how serious you are about the root CA. It will help speed things up considerably if the individual doing the technical side has gone through a dry run on their computer in advance to get familiar with the process. Also decide in advance which of the four people present will receive thumbdrives containing important data required to decrypt the virtual machine and the root CA private key.

Start an audit log file on the USB thumbdrive where the virtual machine will reside. Don't forget to label this thumbdrive with something like "Virtual machine, audit log". In the audit log, document the date, time, and notes for each operation - including each and every executed command against the virtual machine. The start of the audit log should declare who is present for the CA signing process as the first note. This file should be updated accordingly any time the thumbdrive is accessed in the future (e.g. to generate a new intermediate cert every couple of years).

The next step is to make the Linux distribution ISO read only to all users. Depending on the host OS, this step will vary. A simple solution to making the ISO file read only is to burn the ISO to a CD as an ordinary file (i.e. don't burn a bootable CD). If you plan to burn a CD, you might as well burn a copy of the verification information mentioned in the next step. This portion of the process can be done in advance but should be noted in the audit log that the ISO is confirmed to be read only.

Next, verify the ISO against public verification information such as a hash or PGP. Then compare the same public verification information from the same source at a secondary location in such a way that defends against a man-in-the-middle attack. A different computer or device on a completely different network satisfies this (e.g. a smartphone visiting the same URL over a cell network). If that isn't possible and the same machine as the one that retrieved the ISO is used, a secure session to a temporary VPS (e.g. a Digital Ocean droplet), 'wget' the URL containing the information, and then using 'cat' on the retrieved information may be sufficient for all present.

At this point, randomly generate the two main passwords for the virtual machine and CA root key using KeePass and store the database on the thumbdrive(s). If using a single thumbdrive for personal use, use the master password option. Otherwise, use the encryption key option and store the password database on one thumbdrive and the encryption key for the database on another. When operating in a group, clone each password database thumbdrive to another thumbdrive (up to four drives attached to the hub at this point - Hint: Plug in and remove drives one at a time to make labeling and cloning easier). Label each password database thumbdrive appropriately. They will be given to responsible individuals at the end of the process to keep secure along with instructions on use.

Now build the virtual machine with the ISO on the first thumbdrive, disabling unnecessary features such as audio. Enable encryption on the virtual machine disk image file and use the appropriate password from KeePass. Leave networking enabled for the moment. Install the OS, making sure that files are persistent across reboots. Install OpenSSL and verify that it runs. Then power off the virtual machine.

Disable networking. Physically disconnect the host's Ethernet cable/WiFi/etc. Disable/remove the optical drive controller in the virtual machine software. Connect the thumbdrive that will hold the certificates and keys to the host. Boot the virtual machine back up. Attach the newly connected thumbdrive to the guest using the virtual machine software (if it hasn't done so already) and then mount the thumbdrive inside the guest OS:

mkdir /mnt/thumbdrive
mount -t [TYPE] /dev/[something] /mnt/thumbdrive

Now you are ready to construct your root CA. From the terminal, run the following as root:

mkdir root_ca
chmod 710 root_ca
cd root_ca
openssl req -new -newkey rsa:4096 -x509 -days 3650 -keyout ca.key.pem -out ca.cert.pem

You will be asked a series of questions. Use the appropriate password from KeePass to protect the CA private key. Common Name should be something like "[Organization] Root Certificate Authority". Email Address should be left blank. 3650 is roughly 10 years. That's the length of time until you have to go through the whole process again. 7300 days is ~20 years, 10950 is ~30 years. Shorter times are, of course, more secure but create more hassle.

Then, run:

chmod 400 *.pem
openssl x509 -noout -text -in ca.cert.pem
cat ca.cert.pem
cp ca.cert.pem /mnt/thumbdrive/

The first command makes it so only the root user can read the files - why OpenSSL doesn't automatically chmod 400 everything that it generates is a security vulnerability that should be fixed. The second command dumps out the information about the certificate (be sure to verify that "CA:TRUE" appears). The 'cat' command dumps the raw PEM certificate data to the screen (a sanity check for PEM formatted data). The last command copies the signed certificate to the thumbdrive. If you ever accidentally dump or copy the 'ca.key.pem' file, then you will have to start over.

Now we are ready to generate an intermediate certificate which will be used to sign all other certificates. Run:

sed s/CA:FALSE/CA:TRUE/ < /etc/ssl/openssl.cnf > openssl.cnf
openssl req -config openssl.cnf -new -newkey rsa:4096 -days 1095 -keyout intermediate_01.enckey.pem -out intermediate_01.req.pem
chmod 400 *.pem

The first line alters the OpenSSL configuration so that you can generate an intermediate certificate that can be used to sign other certificates. Depending on the OS, the openssl.cnf file might not be at /etc/ssl/. The second command again asks a series of questions is asked. Similar sorts of responses should be used. Common Name should be something like "[Organization] Intermediate Certificate Authority". Leave the challenge password and company name empty. 1095 is about 3 years. That's the length of time until the virtual machine will have to be fired up again to generate a new intermediate certificate.

openssl x509 -req -days 1095 -in intermediate_01.req.pem -CA ca.cert.pem -CAkey ca.key.pem -extfile openssl.cnf -extensions v3_ca -set_serial 01 -out intermediate_01.cert.pem
chmod 400 *.pem
openssl x509 -noout -text -in intermediate_01.cert.pem
cat intermediate_01.cert.pem
cp intermediate_01.cert.pem /mnt/thumbdrive/
cp intermediate_01.enckey.pem /mnt/thumbdrive/

Finally, power down the virtual machine. Disconnect all thumbdrives. The thumbdrive labeled with the encrypted virtual machine and is moved into a physically secure location (e.g. a vault). The thumbdrive with the CA public key and intermediate CA files move over to other, properly firewalled, network-connected infrastructure. Once moved over, if automation is desired and authorized, the password can be removed from the intermediate CA private key:

openssl rsa -in intermediate_01.enckey.pem -out intermediate_01.key.pem

If you need more than one intermediate certificate or are renewing the certificate, adjust the above commands to increment all the 01's to the next available number (i.e. 02, 03, 04, 05, etc).

Four of the people present each receive one of the four thumbdrives containing the password database/decryption key after they sign for them. How this happens is up to the organization. Here is some sample verbiage: "I, ________, hereby accept one of the required components of the [Organization Name] root CA password data store. I will keep this device and the data on it in a secure location at all times. I will not copy, duplicate, or clone the data for any reason. I will not use the device or the data on the device on any machine connected to a network. If any of these terms are violated, disciplinary action may be taken against me up to and including termination. If I ever leave [Organization Name], I agree to transfer stewardship of this password data store to another [Organization Name] employee. Date/Signature"

This concludes how to correctly protect your nuclear arsenal and create a root CA on the cheap with OpenSSL.