A Quick Note on Correlation

The correlation coefficient of 2 variables measures the strength and direction of the linear relationship between the 2 variables.

Strength: Expressed visually by how “line-like” a plot of both variables will appear.  Lines are thin (infinitely thin in the most abstract sense).  Strength is indicated by how close the absolute value of the correlation coefficient is to 1.

Direction: Do the variables rise and fall together?  Or does one variable fall as the other rises?  This is indicated by the sign of the correlation coefficient (positive or negative respectively).

Converse considerations.  While a non-zero correlation coefficient implies some degree of linear dependence between the 2 variables, a correlation coefficient of 0 does not imply independence.

  • The relationship might be non-linear.  The correlation coefficient identifies linear relationships.

Why did I bother to write this note?

Given a linear relationship between 2 variables (or, put another way, a linear dependence of one variable on another) one variable might be used to predict the other.  If the variables in question represent measurements than this can be incredibly valuable because some measurements are much harder to perform than others.  Substituting a prediction based on a cheap measurement for an expensive actual measurement can yield cost savings.  Cost might be “measured” in dollars, time or some other way so this statistic can turn out to be valuable in many industries/disciplines.

Running Fedora 11 in Microsoft Virtual PC 2007

Aside from a few quirks during the install (e.g., the partitioning tool kept trying to format the install partition, which makes installation crash immediately after it succeeds), I’ve been pretty impressed with just how easy Fedora is to use.

For the most part the System menu looks like a clone of Windows’ Control Panel (for XP).  It even has an “Add/Remove Programs” application though, to be fair, that’s a pretty generic term.

To get the window manager to display properly I keep having to manually add the “vga=791” boot parameter at boot.  So I figured I’d edit it using the handy “Bootloader” application that’s in Fedora’s System menu.

It starts, asks me for the root password, then promptly crashes!  On the plus side, at least it crashed with a helpful stacktrace complaining of a missing module called “kudzu”.  I’ve never heard of kudzu but the GUI package manager (“Add/Remove Software”) was able to find it about 10 mins after it started updating its list of software.

Alas, it was not to be.  I tried starting the BootLoader app but was never greeted with a user interface for editing the bootloader.  Ah well, grub uses a menu.lst file for most configuration so I’ll have to edit that manually.

The saga continues…

Visual Studio Efficiency Tip

While working on code it’s sometimes helpful to be able to look at more than one section of code at a time. 

For example, when you’re creating a method that’s similar to an already existing method it helps to be able to see both the old and new methods at the same time. 

Another example is when you’re moving code from one method to another.  In this case, seeing both code sections cuts down on mistakes (less worry about things like “where did I just cut that block of code from…”).

One of my favorite ways to deal with this is to split the code editor window.  I stumbled on it while using Word to edit a document, tried it in Visual Studio and presto! It worked!

Being a keyboard shortcut-o-phile, I’ve setup a shortcut for jumping from one part of the split window to the other (Tools –> Options –> Environment –> Keyboard: Window.NextSplitPane => CTRL+W, CTRL+N).  Since the location of the splitter varies depending on which part of the code needs more space, I tend to establish the split with the mouse.splitting_code_editor

To split the window with the mouse move the cursor to the area circled in green in the image above (above the scrollbar) then click-drag the splitter.  The result, shown below, will be an independently scrollable editor pane.

split_code_editor

Trees in SQL Databases

The Tree is a widely used data structure.  It’s found all over the place from industry (e.g., organizational charts are usually trees) to geneology (e.g., family trees) to biology (e.g., phylogenetic trees).

Trees have been heavily studied by Computer Scientists since data stored in trees can often be located very quickly (relative to the total amount of data).  Trees can also be used as a design tool to identify highly parallelizable components of an algorithm.

SQL Databases are designed around the Entity-Relational model.  The benefits of SQL are many but if we’re to reap these benefits we’ve got to find a way to get the world’s data into a form congenial to the Entity-Relational model.

Converting an existing data model into relational form usually entails identifying the entities and their relationships.  These are then mapped to tables, keys and foreign keys.

Trees are a little bit different in that they are typically modeled as an entity with a relationship to itself.

familymemberstable

The nodes of a tree exhibit a 1-to-many relationship.  Each child has a single parent; each parent can have 0 or more children.  In the relational model each node is represented as a row.  The parent of a node, if any, is stored in the ParentSSN field.  The SSN is the primary key for the table so each row is required to have one.

The ParentSSN field allows nulls because every node does not have a parent (e.g., the root node has no parent).

While trees can be represented in SQL databases in this way, the hierarchical nature of the tree makes querying it via SQL somewhat more difficult than querying other kinds of data.

Oracle supports hierarchical queries via the CONNECT BY keyword.  SQL Server 2008 added support for hierarchical queries.

LINQing away boilerplate collection transformations

Forcing a list of strings to upper case before LINQ:

List<string> list = new List<string> { "element1", "element2", "element3" };
List<string> listCopy = new List<string>();

foreach (string listElement in list)
listCopy.Add(listElement.ToUpper());

foreach (string upperCaseListElement in listCopy)
{
// ...
}






Since LINQ supports transforming sequences inline this can be represented much more succinctly with the following:




foreach (string upperCaseElement in list.Select(el => el.ToUpper()))
{
// ...
}






This expressivity is enabled by C#3.0 support for lambda expressions and the standard query operators.

Keeping Music on Your Desktop and Your Laptop no Matter Which One You Used to Buy it.

I have to take note of an excellent utility buried in the Windows Live software stack for synchronizing folders.  It’s discussed in a SuperUser.com posting (Keeping folders synced between several machines - Super User).

Synchronization (or sync for short) means different things to different people so some clarification is in order.  Like many people I have multiple computers at home.  Sometimes I’ll buy music (mostly from amazon.com/mp3 these days) while using my laptop.  Sometimes I’ll buy it while using my desktop.

Unfortunately Vista doesn’t come with a built in way to automatically keep these 2 music folders in sync.  It’s got plenty of functionality grouped under the rubric of synchronization but none of them let you automatically keep two already existing folders in sync.

This problem is probably so common that I’ll give it a name: “Playlist Frustration Syndrome” (PFS).  The main symptom is awkward pauses as your desktop computer unsuccessfully tries to play music you bought while using your laptop.  Other symptoms include PC Rage, spontaneous swearing, manual copying and, lastly, disillusionment with the whole concept of multimedia convergence.

Fear not Windows Live Sync (formerly Foldershare) is the cure for this and other ills.  I haven’t explored those other Ills so won’t comment on them in this post.

Windows Live Sync supports 2 types of sharing: for keeping folders on a LAN in sync use the “create personal folders” option.  The personal folder option provides a way to map folder A on computer A with any other computer + folder combination.

In my case, for music this means mapping:

DESKTOP\music to LAPTOP\music

Since it uses Peer-to-Peer networking no information needs to be sent over the internet in this scenario.

They even wisely used Universal Plug-and-Play so if you have a router that supports it (and it’s turned on) then any firewall settings are automatically taken care.

This was definitely a smart acquisition on Microsoft’s part.

Software vs Terrorism

Ran across an article about software that’s improving intelligence analysis and has already saved lives.  Although interesting for many reasons I think it highlights the significance of a few trends in software development:

  1. Building software is an iterative process.  To get it right you need frequent input from the end-users.  This is a key distinction between agile methods and classic waterfall design.  According to the article, “Every other week for about two years, the engineers returned to Washington with a revised product, based on analysts’ requests.”
  2. Rapid application development.  You can’t revise a product every 2 weeks unless you’ve got tools capable of supporting that level of productivity.  I would be very surprised if the company’s flagship product didn’t have large components written in one of the more productive languages (e.g., Java, C#, Visual Basic, etc…).
  3. Software Engineers have to be able to work with end-users and subject matter experts to create better software.
  4. The user interface is important.  A key feature of the software is something that anyone who uses google takes for granted: the ability to search several databases conveniently.
  5. Lightweight categorization (aka tagging) can produce better results than up-front, deep hierarchies.  Tagging is easy for users to do (so they are more likely to do it) and flexibly accommodates changing categorization needs.  Instead of having a team of experts build a deep semantic hierarchy that’s out of date within a week tagging lends itself to a bottom up approach.  All the end user has to do is tag the information using whatever terminology makes sense to them.  Software is then used to identify clusters in the data.

I’m sure there are other lessons to be learned from the success of Palantir.  It’s great to see software developed using modern methodologies put to such positive use in such a short time.

Running Fedora 11 in Microsoft Virtual PC 2007 SP1

There are some wonderful open source network monitoring tools out there.  A few that I want to try only run in a unix/linux environment.  I don’t have any spare hardware nearby so I decided to install linux in a virtual machine.

Getting Fedora (RedHat’s free linux distribution) to install and run in a Microsoft Virtual PC 2007 SP1 Virtual Machine was somewhat less than painless.  On a Core 2 Duo Laptop it required the following steps:

  • Disable hardware virtualization in the BIOS.  Not sure why, Debian seemed to work with it enabled.
  • Append noapic nolapic noreplace-paravirt to the options passed the kernel on boot.  This was done by pressing tab at boot and manually adding the options to the end of the kernel boot line.  This prevents several “unrecoverable error system will restart” and “kernel panic” messages on boot.
  • After install but before running the installed OS, append vga=791 to avoid unreadable/pixelated graphics.

Most of these steps are outlined in this very helpful post at Kartones Blog (though he was working with Debian).

Anyone whose had to go through this pain is asked, “Why didn’t you use {VMWare, VirtualBox, Xen, etc…}?”  Virtual PC 2007 is free and was already on the local network.

The Virtues of Virtualization

Back in the bad old days, say 5+ years ago, when an enthusiast user wanted to run more than one operating system he/she would run a multi-boot setup. That is, the computer would boot into a boot manager that allowed the user to choose which operating system to run.

Although this worked it was kind of a pain to maintain. Recovery was dangerous as one operating system tended to overwrite the boot setup of any other operating system. It also, usually, required separate disk partitions. Deciding how much space to allocate for each operating system was a black art that always seemed to leave lots of space unusable because it had been allocated to one of the other operating systems.

Over time setup and maintenance became easier as operating systems increased their support for multiboot scenarios. Linux users are laughing at that statement since linux has always existed in a heterogenous environment and has pretty much always had support for multiboot (via various boot loaders like grub or lilo). Windows may have been a little later to the party but boot.ini has been around since NT.

Even with improved OS support for playing nice with multiple boot scenarios there's something fundamentally unsatisfying about multiboot. One of the most common reasons to run more than one operating system on a single computer is to be able to test software on a different OS. A similar task is to be able to test a web app in multiple browsers on different OSes. Yet under multiboot each "test" requires a machine reboot. This is an incredibly inefficient way to test. Because it's so inefficient it inhibits the frequent "make small change-test change-repeat" cycle that is key to reducing bugs.

Enter virtualization. Instead of having a single computer run a single operating system at a time, virtualization allows a single computer to run multiple operating systems simultaneously. The first operating system that boots is called the host operating system. Subsequently run operating systems are called guest OSes.

Virtualization allows a single computer to run multiple operating systems simultaneously through the magic of software. The key piece of software is the virtual machine. It is exactly what it sounds like. It is a piece of software whose sole purpose is to fool an operating system into believing it is running on real hardware.

A non-virtual machine, e.g., a computer you might buy from Dell, has a CPU, chipset, memory, a hard disk and peripherals. All of these are physical devices. A virtual machine has the virtual equivalents of these; a virtual chipset, virtual video card, virtual hard disk, etc...

There are several vendors that make virtualization software. Each provides their own flavor of virtual hardware. That is, VMWare provides a different virtual video card than Microsoft's Virtual PC, which is different from the one provided by Sun's VirtualBox.

Virtualization is similar but not identical to emulation. Emulation has been around for years. You can take a quad core, 64bit machine and run software to emulate a Commodore 64. I'm not quite sure why people do this but am told it's all the rage in PC nostalgia. The difference between emulation and virtualization is that virtualization depends on a tighter integration between the virtual machine and the actual machine. An emulator runs in a process like any other; the emulation software runs in user mode, its threads are scheduled like any other user thread.

When a virtual machine is running an operating system, it is executing in the same privileged mode (with a few exceptions) as the host operating system. Programs running in the guest operating system can be pre-empted by the guest operating system. This improves performance and contributes to an illusion that also distinguishes a virtual machine from emulator software: the guest operating system need not be aware that it is running in a virtual machine.

Like physical machines an operating system must be in installed on a virtual machine to profitably take advantage of the system. Unlike physical machines the virtual machine doesn't need a physical hard disk; it uses a virtual hard disk created by the virtualization software. The flip side of this is that virtual machines can only use virtualized hardware. Most virtualization software provides support for virtualizing network adapters, USB, printers, serial ports and a few other commonly used peripherals. If the virtualization software doesn't provide a way to virtualize a given piece of hardware then that hardware probably can't be used in a virtual machine.