Tuesday, August 30, 2011

perl web frameworks : part 2

After I finished writing part 1 of this article I thought I was going to call this subject dead as mojo was not behaving as the install documentation suggested. And if you cannot get the simplest hello world application to execute properly then all hope of doing something interesting is lost.

Alas, I head from mojo's author and all seems well. Sort of. First of all he said that I should read the perldoc. (and I like perldoc, oh yes I do). I read the doc and it does correct the statement make in the install guide. I appreciate that, however, I was not given the sense that he feels that the install doc is wrong. So we have a net-zero points here. And then we also skirted the issue of the use of the term 'daemon' as a parameter to launch the app as described in the perldoc. I don't think this issue will be resolved either. I have, however, created a github ticket with the hope that these will be corrected. Finally, I'm not sure that I care what the param name is or if it actually runs as a detached process or attached. I plan to use daemontools to control and monitor the process(s) and therefore it needs to run in the foreground anyway.

The last piece of advice that I was given was to read the FAQ. It's worth the read but there was nothing groundbreaking in there.

On to the rest of the article which has less to do with these frameworks and a little more to do with best practices.

Perl Concurrency

PerlDancer documents LWP as a dependency where Mojolicious said that they implemented their own mechanism and that they were actually designed after LWPng; which is presumably better than LWP. (I'll leave that to someone else.) On the otherhand mojo supports libev aka ioloop; which is a good thing and I happen to be using that structure in both tornadoweb and zeromq. 

[Update 2011.09.03] - I just did some reading on perl threads. It seems that threads are now native sometime after perl 5.8.7 and that the Threads.pm module was removed after 5.10.x. I had no idea that so much had happened to perl since the last time I was in the middle of it all. Clearly they resolved some issues that Python has not. This does not really mean anything to the Mojolicious concurrency model. mojo still uses EV/libev and hypnotoad for "preforking non-blocking I/O". Which does not mean much in terms of multithreading as a single transaction/event can still starve a single process. Hypnotoad is used to fork the main app into multiple evented application instances. (you can infer the performance issues here)

Performance

This is a slippery subject. There are so many benchmarks and many more ways to interpret the results. After eyeballing a few and trying to determine if there was a predilection on the part of the tester... it seems that perl and python are close enough in the performance curve for me not to care. At least there is no low hanging fruit except maybe Perl 6 and pypy. So let's keep this simple.

CPAN, CPANP & CPANMINUS

I previously wrote that perldoc was perl's killer app. I think I was wrong. As I recall the good ole days of perl hacking I'm greeted with the warm and fuzzy of all of the CPAN packages I've installed over the years. Sadly not everything installs on the first try but there are many instances where things just work. Unfortunately I think that the CPAN is starting to show it's age and that perl does not have the following that it once had and therefore many packages are falling into disrepair.

Actor Model (aka Worker Model)

LWP, LWPng, EV, libev etc... are only going to take your application so far. At some point the work has to be divided so that the bulk of the blocking work can a) be performed in the context of a meta-transaction or a literal transaction; b) run at full speed for a prolonged period while the web app might service or queue other requests. In this way you might have N cores and therefore you would deploy N-1 workers leaving one core dedicated to the misc functions. (not that you could do any sort of affinity but at least there might be a surplus of resources)

Message Queues - ZeroMQ

If you are going to implement an actor model then be sure to do so with an MQ. While many languages like erlang and go have built in messaging or IPC functionality they are not instrumented, not standardized, and certainly not cross platform. But they tend to be fast and efficient. So it depends on your design objectives and while I hate dependencies; I do like a good MQ.

One other benefit of a good MQ is that the work can be distributed across nodes... not unlike erlang and go.

NoSQL - MongoDB

I like NoSQL but I have yet to find a use case that really demonstrates it's value. Sure there is BigTable, SimpleDB and a few other implementations that are cool and interesting to study. My intuition tells me that that the amount of data, number of clients, and the number of servers is so out of proportion that it makes sense. But as anyone who has developed for the cloud, even the simplest cloud storage solution gets really expensive because you're charged for a) bandwidth, b) storage, c) CPU. So if your modest application is going to saturate a modest hardware investment what makes you think that big boys can. I'm starting to think that the number of servers vastly outnumbers the number of requests.

NoSQL is just not a real answer for a modest website or application. a) there are no reporting tools as there are for SQL; b) while there are SMEs in the NoSQL field they are currently in disproportion to the number of qualified SQL DBAs; c) there are plenty of ORM modules that make rapid application development easier; even though we all know that the ORM is usually replaced with hand-coded SQL/classes.

Cache - Redis

Redis is considered a NoSQL database by many but it is also a cache engine. It has many of the k/v services and functionality that memcache has but it will persists to disk and replicate too. There are also a number of userspace features that make it a good NoSQL general purpose database.

Redis is probably most commonly used in rails-type applications in order to cache the DB results. This is a good use of the functionality, however, since everything needs to fit in RAM it's only suited for some portion of the data as it reached the applications capacity.

In some web frameworks the caching is implemented as a plug-in.

Conclusion

I'm new to both mojolicious and perldancer. I'm not sure if one is better than the other. While looking at their hello world apps they look very similar. perldancer takes it's heritage from ruby's sanatra and mojolicious from catalyst. Looking at their websites and documentation neither stands out.

If I had some recommendations for Mojolicious: a) fix the doc as I recommended. It's a distraction to the noob; b) while I do not need a lesson in the fine art of installing CPAN modules it would be nice if the cookbook was a little more complete. Specially where it concerns EV and the other tightly coupled modules. Seeing as they are so close there should be some better doc. Not just pretty.

perldancer depends on the CPAN for it's docs. I would have thought they would have rendered their own and added some value where they could. I suppose there is no real reason to use anything but the CPAN specially since they depend on it so heavily. But on the other hand there is something nice about doc that seems to belong.

If I had to decide which would be first. Mojolicious would be my first choice but only by a small margin. I like adding packages when I have to; and I like that it is version 2 rather than a perl version of a ruby application... meaning that the design warts might still be there.

Coffeescript for other uses

I wonder if coffeescript can generate or is suited to generate code for other target languages and if that makes any sense at all?

Monday, August 29, 2011

perl web frameworks : part 1

[update 2011.08.30] as pointed out these are not micro frameworks.

I'm not sure where to begin and I'm not certain where this is going to end. I have an outline and some ideas but nothing is concrete. I'm not even sure what my conclusion is or how I'm going to get there.

What I do know is that I'm a fan of python and perl, among other languages both compiled, interpreted, functional, byte-coded and JIT. I know that I like small and micro frameworks that don't get in the way of getting things done. And I know while I like extensive and mature libraries I do not like deep dependencies. And while like terse language semantics and idioms I tend to prefer just enough to be descriptive and functional (the other kind of functional). As I reflect on that description I visualize my ideal senario as the sweet spot on the bell curve.

History

This past weekend I read an article where the author talked about micro web frameworks. Sadly the comparison was limited to a few python web frameworks and he seemed to miss a few at that. (cyclone and tornadoweb). However, it was a good article [slide show].

This got me to thinking. "are there any micro web frameworks for perl?" My first google search was not very good. Catalyst was in the top 10 over and over so my "find" was not there. I kept trying. That's when these two popped. Mojolicious and PerlDancer. I went to the mojo site and followed a link to a screencast. It reminded me of a rails demo I watched many years ago. This time, however, the presenter used a mac+vim instead of a mac+textmate. I was highly impressed, not because I like vim but because the demo showed just how well thought out this one feature was. So I'm hoping that the rest shows the same attention to detail.

PerlDancer, on the other hand, seems to follow in many of the same footsteps. And while they are micro frameworks with many similarities to their python cousins they have plenty of differences.

Just some quick notes. The author of mojo appears to be the main author of catalyst. And PerlDancer admits to being a kissing cousin to Sinatra.

Installation

To start any evaluation you need to start at the beginning. In this case the installation. One of the things that I first noticed was that they both used cpanmin.us. cpanmin.us is a url/web application used to automate the installation of CPAN libs. Of which these both qualify.

Today was not a good day for cpanmin.us. First of all their DNS servers seemed to be off the air. When the cpanmin.us server returned the installation went easily enough. I had been given some advice for installing the project manually, however, as much as I need to understand the complete dependency tree for this article and whatever projects I might install... this was one skirmish I was going to lose.
$ sudo sh -c "curl -L cpanmin.us | perl - Mojolicious"

After I entered my password for the sudo command everything started to churn and everything seemed to install nicely.
$ curl -L http://cpanmin.us | perl - --sudo Dancer

PerlDancer was not that lucky and it's probably not PD's fault. After the first run I received an error message. PD could not write to the .cpanm folder which mojo had created but with root ownership. This required that I update the ownership of the folder and rerun the command.

The one thing that is missing here is the commands for deploying this project in userspace and yet the critical libs in sys-space. While I have dev servers and most users do too. It's a good practice to sandbox this sort of tool until there is widespread adoption.

I'm not sure if PD simply generated more io during the install or if it just had more dependencies but there was a lot more output from PD.

Hello World

My first hello world was implemented using mojo. It's a good thing for me that mojo had a 3-line h.w. on their website and it worked out of the box.
use Mojolicious::Lite;
get '/' => {text => 'Hello World!'};
app->start;

The PerlDancer, on the other hand, simply did not work. From what I could tell neither the command line tools dancer nor the perl module existed and since I had already closed my desktop and terminal sessions I had lost any hope of determining if there had been a problem during the installation. So I did what any other sysadmin would do. I re-installed it and tried h.w. again. This time it worked:
#!/usr/bin/perl
use Dancer;
get '/hello/:name' => sub {
return "Why, hello there " . param('name');
};
dance;

It's interesting that they are so very similar when it comes to the h.w. code. I'm fairly certain that if I add the bang-splat to the top of the mojo example it would work fine too... but then I decided to experiment with it myself and it did not work. So one strike for and against both frameworks. Net-net the score is tied at zero.

Hello World App

There is more than one way to create a h.w. app.
$ mojo generate app hello_world

Sadly this generated an error. Mojo wanted the app name to be standard perl camel case. So I tried again:
$ mojo generate app HelloWorld

which worked. So then it was a matter to launch the webapp which the documentation suggested that the following would work:
$ ./script/hello_world

But when I tried to launch the application I was returned immediately to the command line with a dump of the usage string. Frustrated by not sleepy enough to give up, I decided to try the same set of steps with PerlDancer. So I created my app:
$ dancer -a MyWeb::App

And the first thing I read in response to my request was... YAML was missing. So although pd was nice nough to give me the install string I had more manual config to do. I would have thought the config would have been complete already. So... I installed YAML and then tried again. This time the command completed ok.
$ cd MyWeb-App
$ ./bin/app.pl

Now the webserver is running. The pages loaded. Maybe not with as much zip as I was hoping for, but it displayed. And as I wrap up this session I decided to try one more thing with mojo. When I tried to launch mojo on my generated h.w. app I received this output:
rbucker@mvgw:~/tmp/hello_world$ ./script/hello_world
usage: ./script/hello_world COMMAND [OPTIONS]

Tip: CGI, FastCGI and PSGI environments can be automatically detected very
often and work without commands.

These commands are currently available:
eval Run code against application.
cgi Start application with CGI.
inflate Inflate embedded files to real files.
version Show versions of installed modules.
daemon Start application with HTTP 1.1 and WebSocket server.
cpanify Upload distribution to CPAN.
get Perform HTTP 1.1 request.
fastcgi Start application with FastCGI.
generate Generate files and directories from templates.
test Run unit tests.
routes Show available routes.
psgi Start application with PSGI.

These options are available for all commands:
--help Get more information on a specific command.
--home Path to your applications home directory, defaults to
the value of MOJO_HOME or auto detection.
--mode Run mode of your application, defaults to the value of
MOJO_MODE or development.

See './script/hello_world help COMMAND' for more information on a specific command.

Clearly I was missing something. The daemon option was the one that I caught first. I supposed that if I used that option that the application could launch and run in the background. Thus making it impossible to see any console output and limiting the amount of debug information I was going to receive. I had tried a few other options like routes which worked nicely and I tried psgi, cgi and eval. It was not clear to me which was going to work. So on a whim I tried:
$ ./script/hello_world daemon

And lucky for me. It worked. So now I have more to test. I went back to my original hello.pl script and tried:
$ ./hello.pl daemon

This worked. I finally had everything working. Of course I'm not happy about the daemon flag. It seems counter intuitive for the moment but worthing contacting the author.

Well; given the amount of confusion and general inaccuracy I have decided to break this article into 2 and possibly 3 parts depending on the response I get from the mojo author.

In conclusion for the moment... if I had to decide the fate of these frameworks right now. It would not be good for them. There are too many alternatives that have good and sometimes great docs. Hello World is not the place where I expected the doc to fail.

In coming articles I hope to cover the following as it applies to this toolset. But only if the system continues to function. It is quite possible that I'm going to test the python micro web frameworks instead or in addition to...


Perl Concurrency
Performance
CPAN
Worker Model
ZeroMQ
MondoDB
Redis
Projects that deserve forking


 

[update 2011-09-01] The guys at mojo have made it very clear that their usage for DAEMON is the way things are going to be. I can respect that although I disagree. There are no less than 3 examples in the same "framework" space that use the term consistently. Only they have inverted the meaning. This does not mean that mojolicious is bad or worse. But it means that buyer beware. You gotta spend more time making sure that they have not change the context in other areas. From my perspective this is too bad. I would have like used it.
for testing:
/usr/bin/twistd --nodaemon --python=foobar.tac

for production:
/usr/bin/twistd --pidfile=/var/run/foobar.pid \
--logfile=/var/log/foobar.log \
--uid=nobody --gid=nobody \
--reactor=epoll \
--python=foobar.tac

QED

Get your Mojo on

This is just a quick post because I cannot contain myself. For the last few weeks I've been hacking on tornadoweb and while I like it there is plenty of room for improvement. Minutes ago I watched a slide presentation showing off several micro web frameworks implemented in python. Shortly after that I watched a mojocast.
In a job interview with Riak's Justin Sheehy I was asked about warts in python. My single complaint is the indenting. perl has none of that and it's simply not as write-once as people suggest!

That's when I happened up mojolicious. I have been following catalyst for a number of years but I've come to hate the amount of dependencies and that sometimes they will not install. Anyway, I'm anxious to try mojo out and see what goodness it brings.

If you think about it... python seems to be in chaos. You can expect to see versions 2.5, 2.6, 2.7 and 3.x as the latest version/package depending on the distribution you are using. While perl seems more static than that.

Monday, August 22, 2011

NoSQL != NoDBA

For the reader who is not familiar; the title of this article reads: NoSQL not equal NoDBA. And what I mean by it is that while the traditional function of the DBA is different in the NoSQL environment; one still needs a subject matter expert (SME) on the payroll in order to keep the "engine" running smoothly. NoSQL is just another specialty.

Many years ago I was caught-up in SleepyCat's BDB libraries. They worked, they were fast, and as they promised; you could forgo a DBA. I developed a few proof of concept applications using BDB and they worked great. They included speed, big data, ACID and everything they promised. Luckily for me, at the time, the projects never ran long enough for a disaster to occur. I know now that, at the time, I did not know enough about BDB to recover from even a moderate system failure.

Today we are inundated with NoSQL alternatives. Riak, MongoDB, Redis, Cassandra, Volt, Orient; just to name a few. To my knowledge, none of them actually state that a DBA is not required, however, they all seem to imply that your developers are going to assume the responsibility. At least Riak and MongoDB have enterprise consoles for the NOC (network operations center) suggesting that they realize otherwise.

Let's start with the schema. Most developers will knock out their first or second iteration of the schema over lunch. And in most cases it's probably pretty simple. It's not until you get into production that "you" realize the warts when your perfect parochial schema. I've implemented several payment systems. The first holds 12B active accounts and processes 12M sale transactions a day(333TPS). The second had a hard time at 25TPS. The first contained only 5 tables and the second was a beautiful 100 table constraint nightmare.

And then there is "real world" data. For example, when you're doing 12M transactions a day Oracle it's still a challenge to export the data so that it can be warehoused and reported upon. ETL is going to take time. That's when one might consider sharding and other approaches to optimization; even normalization (all functions that should be performed by a DBA). However, in the NoSQL/NoDBA world, this function is going to fall on the developer... who is no longer working on new functions or revenue generating opportunities but is instead sandbagging the dam.

As far as SME's go. They tend to know vertical markets or applications very well. They tend not to know every last detail about the data store.
For example, there was a time when my DOS based PC would crash and I've have to fix my harddisk. There was a time when I could and would repair the filesystem by hand, however, after Norton Utilities performed that function in a fraction of the time I had to turn in my keys. And now, when that type of failure occurs on my Linux machine I simply reinstall. I do not have the time or the inclination to repair the data.

That function was always left to the DBA when it came to traditional RDBMS and the sysadmin when the filesystem went bad. I just cannot imagine that anyone would want to perform that function when there are people who specialize in it.

So just because you have read the docs for the client libraries and maybe the source code. None of that makes you a SME. And there is nothing that is going to replace the SME. Just because you're not calling him/her a DBA does not mean that the function is not being performed.

Sunday, August 14, 2011

Is pypy ready for production?

[UPDATE 2011.08.15] I've been considering my own response to the question. I don't think pypy is ready but more importantly, taking a page from the erlang book, I'm still working on getting the code correct. And by the time it's correct either pypy or some other speedup will be available (that does not require changes on my part)

PYPY seems to be getting very good results not to mention performance benchmarks. The numbers are good enough not to ignore; specially with the results at quora. But is it worth taking a risk in production even though PYPY says they have alpha code?

Friday, August 12, 2011

Slicing a list into a list of variable length sub-lists

[Update 2011-08-15] Please see my update at the end of the document.

This was not much of a programming challenge but it was a lot of fun and it only took a few minutes to decide how to implement it. The background for this task was something that, overall, I have not determined how I'm going to address it. Confused?

I'm reviewing a payment system's transaction specification. The messages are not the standard ISO 8583 messages that many vendors use, but they more like a 'C' struct in that it's just a collection of concatenated strings. The individual string, or columns, are in a predictable format but they are fixed length but different from each other. Consider this:
Field 1    :    N    :    6bytes
Field 2 : A/N : 4bytes
Field 3 : N : 12bytes
... and so on ...

There is a useful example for "Slicing a list into a list of variable length sub-lists", however, if you look at the examples you will see that the "step" is set to a fixed length (n=2). And if you look at the range() function you'll see or determine that range creates a list and then the for() iterates over the range. I'm sure there is a way to create a lazy range() function but for my example this will do.

The original code looked like:
>>> n = 2
>>> listo='1234567890'
>>> [listo[i:i+n] for i in range(0, len(listo), n)]
['12', '34', '56', '78', '90']

Here is my replacement for range()... with lrange() where the step param is a list.  I suppose I should test the step type but this will do for now.
def lrange(start, stop, step=[1]):
if not step:
step = [1,]
i = 0
retval = []
while start < stop:
retval.append((start, start+step[i]))
start += step[i]
i += 1
if i >= len(step):
i = 0
return retval

Here is the working example. Notice that we ran out of data before we exhausted the step list.
>>> n = [ 1, 2, 3, 4, 5]
>>> listo='1234567890'
>>> [listo[i:l] for (i,l) in lrange(0, len(listo), n)]
['1', '23', '456', '7890']

And here is another working example where we exhausted the step list before we exhausted the input.
>>> n = [ 1, 2]
>>> listo='1234567890'
>>> [listo[i:l] for (i,l) in lrange(0, len(listo), n)]
['1', '23', '4', '56', '7', '89', '0']

Finally, the code also works when listo is list and not a string... although they are closely related.
>>> listo=[1,2,3,4,5,6,7,8,9,0]
>>> [listo[i:l] for (i,l) in lrange(0, len(listo), n)]
[[1], [2, 3], [4], [5, 6], [7], [8, 9], [0]]
>

[UPDATE] I hated the idea that I was creating a collection of lists upfront. If the set were large enough then a) time to pre-calculate the collection; and b) the amount of storage required. Just think about lrange(0,1000000, [2,3,5,6,8,7,5,43,23,4,67]) or maybe even lrange(0,1000000, range(1,5)). These samples can get rough. This update for lrange uses a "generator" design pattern in Python.
def lrange(start, stop, step=[1]):
if stop is None:
stop = start
start = 0
else:
stop = int(stop)
start = int(start)
if not step:
step = [1,]
i = 0
retval = []
while start < stop:
yield (start, start+step[i])
start += step[i]
i += 1
if i >= len(step):
i = 0

I also have another sample test as I mentioned in the update text. Since  n = [ 1, 2, 3, 4, 5] can be represented as range(1,5) what would it look like nested:
>>> listo='1234567890'
>>> [listo[i:l] for (i,l) in lrange(0, len(listo), range(1,5))]
['1', '23', '456', '7890']

Exactly the same.

Wednesday, August 10, 2011

report: How and Why We Switched from Erlang to Python

How and Why We Switched from Erlang to Python at Mixpanel Engineering http://bit.ly/pli3DY

Please read this article. The most important idea is here:
No one on our team is an Erlang expert, and we have had trouble debugging downtime and performance problems. So, we decided to rewrite it in Python, the de-facto language at Mixpanel.

Tuesday, August 9, 2011

Clojure and Scala... again

[Update 2011-08-09: I am returning from a failure to install scala and clojure. I was able to install OpenJDK from the Ubuntu packages, however, I wanted to build ant from source. And that does not appear to be possible. Ant depends on JUnit and JUnit depends on Ant. This makes it impossible to install everything from source. What's worse is that the build instructions are so much worse than I remember. There was a time when I would install all my Java tools except the JDK and my commercial JDBC drivers by hand.]

I just watched a video presentation from one of the Twitter geeks and he was going on and on about the JVM. What make it interesting is that Twitter is moving from Ruby to the JVM, in fact everything is now written in Java or Scala; and there are some research projects in Clojure.

I was looking over my resume recently as I was reviewing my Java experience for a potential position. And with about 10 practical years experience from the first version of Java (1.02) until a few years ago... it was an uphill battle with most managers. Now "Java is the new Cobol" or "nobody ever got fired for using Java".
I find myself looking back at dynamic languages with a little more interest. Perl and I have a long history and Python and I are getting there.

But I have some criticisms:

  • Eclipse was a wonderful project once. Now it's total chaos.

  • NetBeans is a nice platform but with a little FUD I'm concerned about Oracle and eating your own dog food. We might not be getting the best tools from them.

  • There are other Java IDEs but they are expensive or single purpose. (IntelliJ's Idea is $199 for an individual license; but it supports Groovy, Clojure, and Scala.)

  • I tried to install Scala and Clojure using packages, however, the version numbers were so far back that it's scary. I have heard "please upgrade" too many times.


On the positive side I have a couple of books already. I'm sure they are good enough. I'm still a little bitter about the dependency on Java libs for Scala and Clojure but if you're a functional programmer I think that Scala and Clojure are going to be easier to sell than Erlang or Haskell... as they are incremental steps (baby steps) and they still use the JVM.

So what is the plan or roadmap?

  • deploy jars and wars

  • instrument deliverables in a consistant way

  • use event driven design with MQs

  • watch your TPS rates in terms of events

  • log in a way that makes sense

  • plan to deploy in a cluster (see MQ)

  • plan to use 3 types of databases (OLTP, OLAP, DW)

  • use one-way replication in a rollup fashion

  • Keep it simple by making sure that the unit of work is something the average programmer can pick up.

  • Stay native. If you're writing a Scala app then use Scala libs. And when you cannot, write it yourself. And when that's not possible then use a Java version; and plan to write it when you can.


Scala and Clojure are going to get another shot real soon.

PS: and so I'll probably launch a jar framework for application monitoring. Anyone want to play?

Monday, August 8, 2011

Revisit LUA

Some months ago I wanted to look at LUA so I could investigate it and see where the benefits were and so on. I had resolved to leave it alone. Nothing interesting to see here. But then, recently, redis deployed a version with scripting which was implemented in LUA. This compounded with an unrelated post where an author suggested, and I posit the only killer feature or feature of interest, is that it's worthwhile to learn functional programming for it's own sake. (and nothing more).

So it with a grin that I say that I'm going to kick back and wait for the redis team to flesh out their version and I'll go learn it. I'm sure it's nothing complicated... I just like the design implications.

Thursday, August 4, 2011

OLTP benchmarking is hard to do

I've built a number of successful OLTP systems used in the creditcard/prepaid card market place. One of these systems performs at around 12M transactions a day and the other around 900K. The first system has a lot more headroom. The CPU, disk and network are barely breathing. The latter, on the other hand, struggles and over the last few years I have found myself up late thinking about it.

The 12M system is running on Sun hardware running an Oracle database backend. The application was written in C using Oracle's embedded SQL. This one application runs multiple instances on the same box as the DB and the entire hardware/software stack is duplicated per client. This application connects directly to the internal network where OLTP transactions are routed from the company's internal POS devices for the closed network cards and from the credit card associations for the open network cards. This application also provides APIs that are called by the other applications for services like the help desk, card boarding and plastics etc. Reporting is performed in perl and connects directly to the database.

The 900K system runs on big honking Dell PCs with a SAN to store the data and ease backups. The stack is a Microsoft SQL server stack with the business logic implemented as stored procedures and the message normalization for transactions coming from the associations written in Java. The number of asynchronous socket connections with all of the associations can be duplicated as needed. Same for the gateway hardware that processes these transactions. The transaction is then sent to the database as a call into the first stored procedure which gets a list of the rules, implemented as other stored procedures, that this transaction is made up of. As control passes from one stored procedure to the next the data it collects and works on is rolled into the parameter call stack in order to prevent rereads from the DB. The actual execution of the stored procedures is not bad and for that matter it was a decent implementation and it met or exceeded many of the design requirements; if I can say so myself. But it was still too slow.
I failed to mention a few details. The 900K system implemented 4-way master-master replication. Each machine was processing every transaction from every source. Just think about when batch fees-processing was running! [update] each node was an 8core system with 4 or 8 GB memory for each code.

So where did we go wrong? Well I have a checklist:

  • The 12M system only had 5 tables, the 900K system had over 100 tables in the auth system.

  • Many of the 900K tables should have been in code either hard coded or preloaded during startup.

  • The transactions in the 900K system were lazy. They only read from the DB when they needed data meaning that there were more roundtrips. And in some cases there was some lock escalation.

  • Some of the indexes used btrees instead of hashes.

  • Some tables simply had too many indexes that did not apply and were never used confusing the optimizer and just taking more time.

  • Using a document approach for an account should have improved performance overall. If the document included all of the account information and the current transaction history all in once place.

  • Logging is a killer. The more logging you do equates to more I/O which clearly steals large fractions of a transaction. Consider Redis. They say that they can something like 1M TPS. But if you log 100 messages into their pubsub then you are only going to get max 10K TPS. Now if you read and write to an MQ several times in a transaction then you will experience other performance robbing events. (we did not use an MQ, however, we did a lot of logging)

  • While modern SQL is getting better there are all sorts of arguments for going NoSQL. This works to an extent but it puts a different burden on the design team. You now have to implement a robust API set that you would otherwise defer to some SQL magic.


I think that covers things. I did a small proof of concept after I left the 900K company. I implemented a system without logging, using a document container for the account, using hash indexes for the tables that were important, limited the number of tables overall, eliminate SQL (thank you BerkeleyDB/SleepyCat). And do all of this on very modest hardware. I managed to get 1400 TPS on a very modest CPU.

Now the things I did to get these numbers are not totally unreasonable, however, they break a lot of rules from the business point of view. Business owners like to be able to perform root-cause-analysis. Especially when something bad happens. So some about of logging is inevitable. SQL is really important for report generation specially when the genius programmers cannot be bothers.

So there is room in my head for yet another full blown system. If you look over to the Box Files section in the sidebar there are some system designs that I'm putting together. I'm hoping that someone might actually pay me to develop them. Any takers?

Wednesday, August 3, 2011

LevelDB - a key/value database from Google

LevelDB is a fast key-value storage library written at Google that provides an ordered mapping from string keys to string values.

LevelDB was release a short time ago and they were quick to provide some benchmarks along with the first release which they say is version 1.1. Since I'm falling in love passionate strong like with Redis more every day... why not see if LevelDB was on the level.

Well,, I'm sure it is, however, there are a great number of CONS for adopting LevelDB:

  • Like ZeroMQ, LevelDB is a library and not a server. Although that could be an exercise left to the reader... as Riak has adapted to this engine.

  • It's not thread safe so that is an exercise for the reader

  • The most astonishing element of their description is that they support 3 actions: get, set, delete. While that is all you really need to get real work done the left is up to the reader.

  • Since it's a library and written in 'C', unlike ZeroMQ, there are no client modules. It's C++ all the way.


So compare LevelDB to Redis or any of the other NoSQL databases and you'll be truly disappointed. In today's environment you need a lot more checkboxes before you make a switch like this one.

All-time low for open source

Please follow my train of thought:

  • java is the now new COBOL

  • I used to like java when it was first released 1.02 was probably the best release albeit not speedy

  • There are so many libraries out there that overlap and intersect that it's like looking at JCL through a kaleidoscope and I won't even hint at J2EE

  • There are many alternatives that cover many different languages, python, perl, ruby, go, erlang, haskel, scala, clojure and so on

  • Some of them are clearly better than others... some are just plain stupid

  • I recently endeavored to design a system using python, TornadoWeb, Redis, and ZeroMQ. (simple, easy, fun and productive) The framework is about 3000 LOC and the dependencies are shallow and easy to expand. An alternative to TornadoWeb might be cyclone but it's harder to build.

  • my next potential project needs to built in java including enough packages to be considered J2EE-light. Everything from Spring to Camel. There has to be a better way.

  • I thought to recommend SkyNet

  • It depends on doozer and go-lang

  • go-lang installs easy enough

  • doozer is crap, builds silently, but the test program does not compile.

  • doozerd is crappier, does not build because the libs to not match the go-libs. Why? fixed-em. I tried to run the test program and it crashed. Same errors.

  • roundup is worse still, there are even fewer docs here. I tried the normal build. FAIL. I tried the git version. Worked but there are no docs for me to test it. And now it's installed... I hate that. Need a sandbox for the install much like go.

  • So I go back and look at the java packages.

  • they're not so bad... if the project framework were templated.

  • Yes it is. More dependencies means more maintenance, more regression testing, more reading, harder debugging, more logging, slower transactions.

  • At the very least if I used Grails I'd have a chance to implement a smooth and normal install path. Not the chaos of package de jour.


Meh!  I'll make it work anyway but it's not going to be as much fun.

PS: What every happened to COBOL?

Monday, August 1, 2011

What is the attraction to Function Programming?

In computer science, functional programming is a programming paradigm that treats computation as the evaluation of mathematical functions and avoids state and mutable data. It emphasizes the application of functions, in contrast to the imperative programming style, which emphasizes changes in state. -Wikipedia

I've been a programmer for 25+ years and back in the earliest days a lot of PC-based code was implemented in x86 assembler. Later on 'C' became the de facto standard yet we still counted CPU instructions and IO operations.

As the web evolved and programming languages have come and gone many of the oldies are still in play. In fact many of the benefits of functional programming languages are limited or obscured by the non-functional OS and systems they rely on. Both scala and clojure rely on Java's JVM and they also depend extensively on standard old java libs, which are not functional, instead of replacing everything with native versions as one would expect.

Haskell and Erlang, on the other hand, while they are basically clean room implementations with few dependencies other than GCC and many of those libs; at some point in the stack they touch the libs for the OS. Erlang has the benefit that the libs and overall packaging seems contained and professional where Haskell seems more chaotic and scientific. Not that one is actually better than the other... Haskell's Snap seems to be a better platform than Erlang's yaws, webmachine or nitrogen.

Recently there was a post on slashdot in which an anonymous coward indirectly suggested the smartest programmers don't work for Google. I'm sure there is truth to the statement but the description that this person gives of his day to day activities are no different than I have experienced depending on the circumstances. In fact it takes more balls and bravado than high IQ to hack trading systems in the way he described. In fact I dare say that the SEC might consider an investigation into this individual; while he might be smart, some day he's going to write a patch that his bank cannot cash.

What initially attracted me to Erlang was it's sigma references. It seemed to me that if a phone switch was to operate at 9-sigma then my bank software should too. Now that years and several applications have passed, all written in Erlang, I have come to the conclusion that sigma is not a good enough justification for a business application. And that it should be a business decision as to which language one implements the business in.

I think there are two main reasons why functional programming is popular: 1) because it is different. 2) because the smart guys are making a lot of money with it.
Just because you say you are "different" does not make you different wen you are standing in a crowd of people who are all different too.

In response to being "different". There is no real justification here other than curiosity. There was a time when C and Pascal, forth competed for mindshare. In fact P-Code almost became the foundation for an operating system/platform but it never had the performance or mindshare. Once serious development began on *nix 'C' took hold although MS-DOS was still written in x86 assembler. And so were a number of desktop environments like TopView, DesqView and many others. C was low enough to let the programmers tinker and high enough to be useful for cross platform development. And easier to debug than assembler.

And as far as what the smart guys are doing... it also reminds me of a time when I almost went into smalltalk development. At the time consultants were making 100 - 150% more than the average programmer at the time. And while a lot of good work had been accomplished at the time. Where are these guys now? Where are their applications? Who is maintaining them?

As a side note; part of the reason why Microsoft invested in .NET was partly because they wanted to give their platform a name. Some way they could bolster OS, it's development platform and the collective ego of the .NET certified programmer. Microsoft's choice to alter java with J++ and moving everything to their CLR was all about mindshare.
I recall interviewing for a company that did datamining of personal information. I liked the idea and complexity of the problem, however, after making it through the interview process I was told that the position required that the programmer use their internal programming language. It was internal and proprietary. Needless to say I never accepted the position.

Yahoo is primarily a PHP shop. a) to be different from the other web properties. b) to keep their employees from jumping ship.

Google is primarily a python shop.  a) to be different from the other web properties. b) to keep their employees from jumping ship.

These companies have functional applications too. But why? Is it really worth the effort? Should we really re-tool for functional programming? Consider the amount of Java code that is out there. Why would anyone rewrite it in the functional fashion? Consider all of the non-functional code that we have come to rely on. Should we really rewrite it all in a functional environment (this is the only reason I like scala and clojure)

This topic makes my head ache.  There is no rational justification for the attention to functional programming. It's just a game.

another bad day for open source

One of the hallmarks of a good open source project is just how complicated it is to install, configure and maintain. Happily gitlab and the ...