Tuesday, February 28, 2012

Giving Privacy a Pass

As I've written recently written I am in the process of moving my backups to TimeMachine, SuperDuper and BackBlaze. One of the things that I was thinking about was the amount of time that it was going to take for the initial upload to BackBlaze. When I was looking at other solutions the initial upload was going to be on the order of months. BackBlaze, on the other hand, is said to auto-throttle.

I'm assuming that auto-throttle means that they watch the system load and then throttle the amount of bandwidth that it takes on the single system. Since I'm backing up 2 systems with individual accounts (one master account) to BackBlaze both machines seem to be humming alone nicely.

Back to my thought... in what order are the files backed up during the initial upload and thereafter. I'm certain that there have been studies on the average user's average file size and the number of edits over it's lifetime. And this information would be key.

During the initial upload I'd probably sort the files by size and not date. Doing all of the smallest files first. There are two reasons. (1) because it shows that the backup is making progress and the user is less likely to abandon the upload if they make quick progress. (2) in the event of a crash during the initial backup you might have a better chance of recovering more of the system in terms of individual files. The same cannot be said for the incremental backups; the largest files might get starved.

Anyway, BackBlaze does not appear to be taking months to achieve the online backup I was hoping for. Let's hope I never have to perform a restore... but there is something to be said for purchasing the occasional snapshot.

But now for privacy. Let's say for the sake of argument that I have elected not to encrypt my data that is being backed up. Now immediately before the backup, if BackBlaze generated some sort of signature of each target file and compared it to the entire dataset on it's servers it could reduce the backup time and duplicate storage costs by consolidating duplicates. This would work well for movies, videos, music but not individual unique titles. However, at this exact moment BackBlaze is backing up my iTunes library. I know I have about 8,000+ files and BackBlaze is reporting that it has 8,000 files to backup. I do not care if anyone knows what's in my music library. It's all paid for. Seems to me that there is no reason to upload them (same for iTunes match).

Saturday, February 25, 2012

The ultimate backup strategy

[Update 2012.03.01] BackBlaze is reporting uploading about 6GB per day. Not bad but based on network performance. It's pretty good and it's still going to take 10 days. Still better than 9 months.

When it comes to backing up servers in a business environment MBAs usually calculate the cost of remote data centers and the cost of replicating the data and the bandwidth needed between centers versus local tape machines and offsite storage and the cost of recovery.

When it comes to individual computers traditionally businesses usually do nothing and let the users fend for themselves. Recently, however, companies have been taking steps to protect their property by encrypting the entire hard drive for laptop users and backing up the equivalent of "My Documents" and leaving the rest of the computer as-is. One other corporate solution has been virtual computers running on computers in the datacenter.

In either the enduser or server case the business has a lot to lose and so a proper disaster recovery plan is essential.

The home user has gone virtually unnoticed until recently.

In the olden days we used to backup to floppy, then ZipDrives, JazzDrives and then CDROM and lately DVDROM. And at some point the cost of a tape drive was within the reach of the home user. Now data is in the TB range and the shelf life of a recorded DVD is limited. What now?

With the new cloud services like CrashPlan, DropBox, BackBlaze and such one can plan for the eventual outage... and they will happen. But when you have many hundreds of GB of data it can take months to seed one of these cloud services as I recently commented. And each of these services offer different features that might fit different scenarios.

So what is a home user to do?

The first thing a home user should do is decide what sort of disaster he is going to mitigate. If it's a simple B&E, the 100 year flood, software virus, or a theft from a hotel room. I definitely do not have all of the answers but this is going to work for quite a few of them.

  1. Encrypt the entire drive. It's a new feature in OSX Lion.

  2. Buy a drive that is the same size as your primary drive and use SuperDuper once a week to duplicate the primary drive. In a disaster you will be able to boot from USB or even exchange drives with little downtime. (this is harder and more expensive with SSD drives; think matchbook air.) Make sure the target drive is encrypted.

  3. Buy a second drive which is equal to or larger than your primary drive and use OSX TimeMachine to keep incremental backups. Make certain that you encrypt the target drive. (encrypting the target drive could take days over USB).

  4. Get a free account from DropBox and sync your documents folder. This should fit in the free account but if not the cost is fairly reasonable. Install a client on you smartphone for remote access too. This is not going to give you access to your photos but you can share docs etc.

  5. Finally, get an account with BackBlaze. These guys have a very novel approach to backing up your drive. Unlike CrashPlan that keeps incremental and versions, BackBlaze backs up your entire drive and deletes deleted files and old versions every 30 days. This is ideal for an offsite backup that supplements the above strategies.

The strategies I've outlined above will prevent any major disaster and if you're like us with many hundreds of GB of one of a kind family memories then this is the strategy for you.

Of course if I were an attorney I'd be warning you at this point that there are no guarantees and that acts of god are not included. But this is the human equivalent of giving it your best chance to survive. Or at least your data to survive.

Friday, February 24, 2012

iPhone battery life with iCloud

I do not have any proof, however, over the last two days I have had a serious drop in battery life and the only thing I have done is turn on iCloud. Granted I expected that when I first initiated PhotoStream because all of the images that needed to be passed around.

However last night I left the house with 80% battery and while I was sitting in the waiting room of a local hospital (with free wifi that requires a T&C agreement) The battery dropped to 40% within the first hour. The battery was at 20% by the second hour and the iPhone turned itself in the 3rd hour.

Seems to me that iCloud+wifi is just draining the battery but I have no proof.  I've turned off iCloud and now let's see what happens.

GMail wins over Apple Mail : 1 - ziparoony

It does not matter how I got here but it's enough to say that I have several custom domains and I serve their email using gmail. Given that I have multiple desktops and mobile devices keeping my email clients in sync has been a challenge using Apple Mail. So I stopped. Mailplaneapp.com filled part of the void because it supported multiple active accounts. But as I discovered today the killer app was missing.

I was starting to retry Apple's iCloud again... as I mentioned (sort of). Anyway I started tweaking my email settings again and that's when I started to reconsider Apple Mail. It had been running for the better part of the day when I was about to cut the cord to my browser based email client. If I closed my gmail window I was going to look my google voice client.

Sure I use Skype, Google Voice and my cell phone but there is no one device that I use regularly. And in this case if I abandoned gmail in the browser then I lose my voice. Simple as that. Winner Google.

Thursday, February 23, 2012

Why the hiring bubble is bad for everyone!

Every once in a while a company like Rackspace manages to garner some social or media prominence for having all sorts of open positions. Recently, here in south Florida, there were several large employers (Saveology and one of the casino chains) made the news with hundred of open positions.

But here is the probably outcome... short term lower unemployment.

The reality is that when an employer says they have some number of open positions and it makes the news they tend to get exponentially larger number of candidates. So right off the bat it's going to cost them more to connect with the candidate pool.

Since there are a plethora of candidates the salary pool will be "managed" meaning that they are not likely to be hiring the company's chief scientist, economist, etc... it's going to be a middle to entry level position. Face it; it's a cattle call.

So here is some advice for both the candidate and the employer.

Employer. When you are hiring via a cattle call. Either make certain that you have room for the exceptional contributor in the budget or make certain that you do not hire anyone who is is overqualified. And you might want to make sure that you are responsive to all of the candidates. There is no need to take 2 months to say thank you. This is the technology age, do not ask anyone to retype their resume and do not use sites like jobvite or Taleo that do.

Candidate. When you go to a cattle call. Be aware that you are probably going to be passed over if you are over qualified. You should get some sense of things from the job description. So either dumb down your resume or expect to be turned down. But if you take the position and you cannot get whatever adjustments you need in 30 to 90 days... expect to be fired or feel the need to quit. So keep that resume going.

Here is a true story.  I have a C-level friend and a few years ago he was interviewing for a new position. Some company offered him a position with a salary that was two-thirds of his previous salary. At this rate he would be a complete steal and very overqualified. While he was negotiating the position he said something like: if you hire me at this salary I'm going to continue looking for a job and I'm probably going to take it. There was probably some pleasantries that bookended that message as to not sound like a total jerk but it makes the point. Anyway he started his own company and is making a living and enjoying the quality of life.

Another good friend of mine draws on extensive management training and his advice is that you have to take salary out of the equation so that employees are only concerned about the work and family and not the salary. So unless you are an entry level employee or you plan to move laterally then cattle calls are not the way to get good jobs.

just a few more things:

  • cattle calls typically mean that there is also going to be a culling at some point.

  • event for entry level positions with entry level candidates they are going to get slightly lower salaries.

  • with all this money heading out the door there is not likely to be funding for relocation.

But good luck anyway.

PhotoStream is not DropBox

It's fine and dandy that "we" tend to press the AGREE button for every update to the EULA changes that Microsoft, Apple or just about every vendor produces or makes. We have become accustomed to the notion that someone is reading this crap and if it's too radical or restrictive someone else is going to complain when there is a genuine conflict.

I'm still having trouble making an offsite backup of my family photo albums. My wife has taken over 20K pictures since we first started dating and I introduced her to digital photography. Add the video and we have almost 300GB and according to CrashPlan and Bitcasa my upload is going to take months even at full throttle.

Since PhotoStream storage is not counted as part of the disk usage I thought it would be a great place to store our pictures. I'd have to find an alternative for the video but for the moment this was the plan.

But when you launch iPhoto for the first time and select the PhotoStream view, the splash screen says it all. "keeps the last 30 days". Unfortunately Apple has lost yet another manual and so there is no way to know exactly what that means. And taken in the context of more "air" and iOS type computers coming from apple I'm just not certain I know what the process is going to be in order to actually preserve the pictures.

PS: 30 days is ok if the family shares the same iCloud account and you just want to share instantly. But it's going to suck when it comes to battery life.

Wednesday, February 22, 2012

Scrum - always better to go first

I remember in (elementary, middle, high, college, university) school that I hated to go first. It did not matter what it was I just hated it. Think of it as the leadoff hitter in baseball. (which turns out to be a good place to be)

What I just realized, after being on a scrum call with my client and their entire development team, is that I want to go first. By going first there is a psychological edge that you are the one that everyone else if following. If someone says that they implemented 'X' and you also implemented 'X' it's better to be #1 instead of +1.

In today's call it appears that the code is forking and efforts are being duplicated. #1 might be tasked to continue as-is where +1 will likely be seen in a me too position and that person is likely to be reassigned.

So unless you do not like your work... be #1.

The same can be said for bad news but that's a story for another day.

Wednesday, February 15, 2012

More Management Books

I added a few book recommendations to my book list from my friend Hass.

Apps need to ask permission to use the address book

In a recent techcrunch article the author suggests that OSX should prompt the user when an app wants to access your address book. I initially thought that this was a good idea... but that did not last long. If you've used a recent version of Windows you've been prompted a hundred times a second for one permission or another.

There is no doubt that we need to have our apps sanboxed but the price is going to be a lot of user friction and could ultimately cripple OSX. I think I'd prefer something like little snitch.

Tuesday, February 14, 2012

Scala + Clojure

I just wrote an email to a friend that started out as a note and turned into a speech.

I spent some time on both. There is nothing unusual about either of them. (a) they still depend on the JVM (b) for whatever functional goodness they profess to have they depend on the JVM and more importantly they are still inheriting from libraries that use the standard JDK. So unless they and your app use PURE scala or clojure libraries (see that Lift uses Jetty) it's just a non-event for now.

PS: for all the benefits of "private" and "protected" was once thought... it's a waste of time. (a) people have access tot he source. (b) people want to see the implementation details as part of "full stack awareness" (c) people need to profile and build everything from scratch (re PCI).

JITs are making their way into most languages and they are really fast. I like the performance of tcl, Lua, Python, perl and Ruby... (I think I've decided that I like Rails but I do not like the Rails movement; different things). The fact of the matter is that everything scales in all directions. Just make it easy on yourself and reduce the amount of code.

Getters and setters are STILL evil!

I'm a 3rd party reading the source code of another 3rd party for a project I'm working on and I find my head spinning because the Class that I'm currently reading is 90% getters and setters. Of course I did a little googling in order to see what the current state of accessors was and that's when I found this evil article. Looking at the byline I think it was written in 2003 and of course not much has changed since then.

What troubles me about the article is that it nor any of the OO gurus ever discuss anything but the simplest OO objects like point, line, square, circle. Sure, when you have essentially 3 or 5 input values it's simple to accept them in the constructor. But when you have 100 instance values of varying types then what? In the evil case getters and setters are used so that the data is validated during the set based not he rules of the class declaration. Using a single get/set does not give you that functionality and validating in the constructor makes the class unwieldy.
With so many getters and setters there is so much static code that would need to be hand coded and the only shortcut would be calling the getters and setters with some reflection. Unfortunately that has other side effects when combining meta programming and OO.

Personally I like the python approach of typeless data. Then I wrap the instance data in a hash called a dict in python. And then if I really need to validate the data it's either done when the data enters or exits the system (using a userspace data type dict) or when constrained in the data store (DB).

I'm just not a fan of the getter/setter. Too much code. Too little payback.

Wednesday, February 8, 2012

VOIP reporting and reconciliation

About 6 years ago a friend of mine asked me to help him out. He has a VOIP arbitrage business where he buys and sells VOIP minutes. At the time; in the course of a day he might complete 100K calls. Upon the completion of a call the "system" would generate a CDR (call data record) representing the date/time, duration, disposition, route, call source and destination. At the time he was using some legacy switches and when he compared his invoices to his suppliers and clients; he could not reconcile the transactions. As a result he was losing a lot of money.

The first program I wrote for him reconciled the data from the various switches.... like a forensic accountant I found the money. The issues were several fold. (a) there was a time shift in the reporting from the 3 systems. (b) there was a bug in the client's system where they failed to record the data in the correct DB shard at midnight. (the call originated yesterday but completed today) and (c) a bug in the vendor's switch failed to complete the call when the call terminated.

Fast forward 2 years and my friend was converting from hist legacy switches to a newer Asterisk server. This time he was having troubles. Between system and application crashes, overall call volume, system performance, reporting performance... it was a nightmare.

So this is what I did:

  • separated the dashboard GUI from the switch. The database consumed a lot of CPU.

  • the switch generated flat files with CDRs instead of accessing a DB directly... this was a huge savings because reporting would constantly block the switch.

  • on the dashboard the CDRs were loaded into temp tables for proper ETL instead of into the target tables.

  • and I implemented the equivalent of map/reduce in the form of a rollup table.

  • and all of the CDRs were stored in monthly shard tables so that indexing and reporting performance could be optimized... also deleting a shard is easier that trying to delete a month's worth of CDRs.

  • finally I automated backups of the original CDRs. While there is some potential for data skew based on versioning of the ETL reloading the data was more important.

Now that the switch could handle lots of concurrent calls with and without media... What else can go wrong? A lot actually. Before you think, "hey, convert to FreeSwitch". There is no need to go there. First of all the switches I deploy can handle close to 10K channels (2 channels per call) with media; peak. Second, my client knows how to configure an asterisk switch and so I do not have to support him other than when things go wrong; typical config bugs. Third, while a lot of people have had success with FreeSwitch in this use-case I have personally experience some failures...

Most FreeSwitch people make is quite general about call volume and they point to the semaphore locks inside Asterisk. Personally I have not experienced that, however, I have experienced capacity limits but that was based strictly on transcoding. The fact of the matter is that most codecs use about the same bandwidth and memory between Asterisk and FreeSwitch. So if the semaphore can be rendered harmless with multiple speedy cores and lots of memory then it's practically moot and I'd rather my client was happy.

The next set of troubles are just a pain... like the pea and the princess.

  1. the reporting is losing approximately a second.

  2. we are still losing calls (missing CDRs)

Actually this is not a big deal. Asterisk and FreeSwitch report their call duration or billable seconds in seconds and they discard the fractions of a second. And if you've ever watched the movie Office Space then you know that there is money to be made by collecting the fractions. The clients and suppliers in this business are fully aware of this fact and they round the call duration in their favor so we have to make a like adjustment in our reporting to make sure that the fractions are accounted for. On average there is a full second lost in every completed call and when you complete 1M calls a days that can be some serious coin.

Capturing the CDR has been implemented inside the Asterisk AGI scripting. This means that when Asterisk is processing a call though the extensions_custom.conf file that there is a command to record the CDR in a text file when the call is completed or there is a hangup. However, if the operator restarts the switch(asterisk or freeswitch), the hardware, or reloads the config files... then a CDR will not be generated for the calls currently in flight. Restarting when call volume is high can be disastrous to the bottom line. Currently, the only way around this is a command in Asterisk like "stop gracefully" which is supposed to delay an asterisk shutdown until after all of the calls have terminated. This of course has other side effects but at least the data is safe.

What's next?

  • Currently the CDR exporter is written in PHP and so there is some performance lag while loading the PHP interpreter. I'd like to replace it with a C or GO implementation.

  • I'd like to put the "restart" command directly in the dashboard as a function. Then capture the restart events and the current "live calls". The live calls would then be converted to CDRs while the restart taking place.

This application is by no means in the "huge" data domain but it does demonstrate some of the complexities. On the other hand there are complexities here that most "huge" data projects never encounter. I would liken the difference in the fuzziness factor. This project cannot afford any fuzziness. Failure to be accurate for just a few minutes and you could be leaking 10s of thousands of dollars. While fuzzy is ok for search results it's not ok for adwords.

Monday, February 6, 2012

Payments are pretty simple

I'm amazed that a company like Stripe is making such a big splash and looking at the headshots I'm further amazed that a company this young and median employee age is still expanding. And finally if you've ever been a merchant or a merchant processor then you know that there is always a line in the sand and Stripe is not even talking about it. They have done everything they can to remove the friction from your payment process.

I have been soliciting my friends and connections in order to build an open source issuer or acquirer system. One of the reasons is because it's just outright simple. Moving a transaction from a merchant-based API to ISO8583 is pretty simple and parsing an 8583 message and processing a debit or credit transaction is even easier. The hardest part of the issuing side is the HSM which is also the most expensive component next to the software delivery. With the PCI audit, colocation, networking, and network minimums a close 3rd or 4th.

At the end of the day it's just a simple data-in / data-out problem with a tiny bit of math and encryption.

As for how Stripe accomplished the task; that is anyone's guess. Mine, however:

  • keeping the friction means less code and less hassles for the merchant.

  • keeping the fees down will attract more merchants

  • keeping salaries and expenses down reduces the desire to raise prices

  • at some point in the process the code starts to pay an annuity and requires less maintenance or enhancement.

  • In order to keep the cost down (KYC) they must be assuming some of the risk themselves, however, many of the (new) banking regulations must be in their favor.

I'm sure there are some insider reasons why Stripe is successful but for my money these gotta be the biggies. But if you compare Stripe to the likes of First Data Corporation it's all about the overhead or lack thereof.

This stuff is simple and as new companies enter into the vertical market or defining new vertical markets it's going to be about the lower barrier of entry.

I feel marketed to by Google

I guess this is why Google is getting the big bucks or at least why they got the big bucks. It certainly does not explain why their stock (a) has not changed much since late 2007, (b) why they have not split. But this is off target.

I spent a few minutes today trying to make some new business cards. I was on a few different sites. I finally made a selection and input all of my information. When it came time to pay they wanted 2x what I thought I was supposed to be paying (according to the advert). So I cancelled the order.

Now I'm doing to reading when an advert popped inside the story I was reading. It was an advert for that same card company. This time there was a coupon code. BIZ500. So I'm going to try again and see how much money they want this time.

Saturday, February 4, 2012

I hate wires

There was a time earn I had a small 8 node cluster in my home. I had a bunch of monitors, keyboards and mice. Eventually I was able to cut the I/O down to two monitors, two keyboards, and two mice. But I still had so many wires running around my desk. At first everything was nicely aligned and channeled. Then after the harsh hardware upgrade (a new drive) the wiring started to unravel. In the end I moved the hardware to storage and converted my lifestyle to laptops and wifi.

I've had many years of success with a single laptop. And then there was a single failure that nearly cost me dearly. In the end I gave my wife my refurbished computer (damaged MacBook battery), bought a MacBook Unibody and then 6 months later bought a MacBook Air.

My workstation is a outboard 25" monitor with Apple's wireless bluetooth keyboard and trackpad. Both are connected to my MB-Uni. I use the MB-Air for Pandora, Twitter, Skype, some email and other ancillary apps.

But now my desk is getting cluttered again. I have a printer on my desk but I'd rather use a HP with wifi that supports my iPhone and iPad. I'd like to share a single keyboard and mouse between my MB-Uni and MB-Air. (Synergy is close but not idea.)

So I'm struggling again to keep my desk clear. I'd pay good money for a solid KVM solution that was wireless and did not interfere. But it seems that KVM technology has not changed much in the last 10 years and the prices are still crazy expensive.

What's a wireless boy supposed to do.

Friday, February 3, 2012


In response to Why_Lua.

To say that Lua embraces the do-it-yourself approach is silly. Every programming language started this way. That includes ruby... long before Rails. But while Lua has some nice attributes it lacks one thing. A killer app or a killer tool. Just because WoW uses it as a script language is not enough. Of course if we knew what the median age of most WoW script kiddies then we might be able to estimate the approximate time until Lua is a leader simply due to attrition.

For my money, I have read web site after site and I have yet to see any concise direction. Sure the language is "consistent" but the implementations are all over the place and semi-forked.Just look at Lua and LuaJIT. While the version numbers are not required to align but it would be helpful.

LuaRocks has some potential as a package manager site, however, just from the outside it does not measure up to pip or gems. Some day possibly.

About the only real hope I have for Lua, in the near term, is that it's being used by Redis for scripting. And while I disagree with the lead developers it's not for me to decide and I'll be forced to learn it for that reason alone. I know the Lua team is supremely confident but if it's not ready and it's gets overly popular it could implode as a product of it's success. (scale taking a different form).

Good luck.

PS: I forgot one thing... I really like ZeroMQ and these benchmarks are awesome.

Where is all the shell love?

Some devs(developers) out there are showing some biased interest in zsh. I definitely do not know enough about all the different shells in order to get into a shell skipping debate but I can feel the love-loss. I have a fresh Ubuntu 11.10 installation and in the /etc/shells file you'll see:
# /etc/shells: valid login shells

I do not know why screen is in there but, ok, it is. Personally I have been using bash for as long as I can remember. I started with sh, csh and ksh but it depended on the unix I was using and what the default was. And since it was so far back people just left well enough alone. When I discovered bash it was strictly because of color, some autocompletion, but mostly command history.

Now that zsh seems to be getting somewhat of a revival it's not hard to take a look. And so I did. And I don't like it at all.

  • Part of the key success of ksh are tools like oh-my-ksh.

  • oh-my-ksh uses ruby as a systems language to script the plugins and that, of course, gets wonky when you use rvm.

  • the refcard that OMK provided is almost 10 pages of trifold.

  • OMK is a simple set of tools and framework but when the authors take credit for 217 contributors I'm thinking chaos... and sometimes chaos is just that.

There are certainly some take aways from the ksh one-up-ness that I would like to see in bash but not to the extent that ksh projects.

another bad day for open source

One of the hallmarks of a good open source project is just how complicated it is to install, configure and maintain. Happily gitlab and the ...