Sunday, April 19, 2009

Data Typing

Data comes in all forms. Names, ages, birth dates, addresses, temperatures, day of the week, time, price, pressure, categories (such as M/F for gender), scores, zip codes, rankings ("on a scale of 1-10, how much do you like X?"), flow rates, keywords in news, size of a package, tracking number, lot number, and on and on and on.

Sometimes data is a unique identifier for other data, like tracking information by Employee Number or Social Security Number, or Date/Time, or sequence number, or lot number, ... etc.. In these cases, the data is an identifier for a row or block of data and could be in most any form and can have sub-meanings as well (employee numbers or lot numbers may be sequential... or maybe not).

Generally, when processing data, it is good policy to keep the data together. That is to say, if you are computing something based on a person's age and zip code, it is smart to not allow these to get separated from the person's identifier or allow a stock's price to be separated from its date. The data might look like this:

Name,Age,Zip Code,Yearly Salary
Tom Jones, 38, 55439,34504.44
Betty Mabel, 23,97404,45,505.00
...

Name is an identifier, but not necessarily unique and could be a gender proxy (most of the time, but not always... Sam... man or woman?) and age is clearly a number (or maybe age is really a date difference: years since birth, truncated) and zip code is numeric, but not really a scalar number, and salary is a floating point or "money" type. So, the point is, even at this point we have a hard time telling the purpose, use and meaning of the data, and even though it seems so obvious, really it is not. It depends on what it is and how we intend to use it.

Let's say we wanted to pass along an array of this data for processing by various tasks. We would need to "cast" the data array to a type (string, integer, long, money, date, ...). Hummm... things get sticky because most arrays are of a single type so the data might have to be split into multiple arrays of different types but that type depends on the meaning assigned to the data at that time (and the presumptions of the person doing the data typing). So "Tom" might be stuck into a string array while his age might be put into an array of what... int's? doubles? floats? decimals? monies? Humm.... Just create an array of "object" you say? Possible, but in many cases Microsoft will "type" the object with the first assignment, so no, it might get messed up. In just those 4 very common variables, there are many interpretations and data types that could be used!

Sooo.... In Intellect 3.0, we take the easy, but very effective, way out... In those cases where we are sending arrays of data around the system amongst data processing tasks we keep it as a string array. The Intellect 3.0 tasks convert the data into the type suitable for what that task is doing, whether that be decimal or double, or float, or int, or category, or enumeration, or date, or keep it as a string, or whatever makes sense at that moment. This allows us to keep "Tom" in the same array with his age, zip and income and stock prices with their dates and times.

Downside: Lots of conversions going on from string to a multitude of data types.

Upside: We keep data together and the data is properly used, in different ways, for different purposes at any time in the process. We can be flexible.

Blessing: In .NET Microsoft has a super job in string processing, conversion, memory allocation that makes this lightning fast, even on millions of rows. In Visual Studio 6, if you even had a 50 column by 150,000 row string array in RAM, the application would likely crash. No problem in .NET.

Now, handing array data as strings is not required in Intellect 3.0, actually tasks are shuttling messages containing objects of a declared type, but that is a topic for another post another day.

No comments:

Post a Comment