mercredi 9 décembre 2009

Transforming column datafiles into line datafiles on Windows




In the field of computer science, you frequently need to visualize data: it always makes things clearer when you show a chart, the reader gets the picture just by looking at it, without having to read the painfull 15-lines paragraph below.

So you have data. Sometimes lots of data. Sometimes badly organised text data files. For instance, where successive values are not on successive lines, but on successive columns. This might not be clear for everybody, so here is an example. Say you have a file containing temperatures measured every hour, for some period of time (a month, for instance). The usual way of doing is one measure per line:

1/1/2009;0;22
1/1/2009;1;23
1/1/2009;2;22.5
2/1/2009;0;24
...
and so on. And sometimes you run across datafiles where the layout is:

1/1/2009;22;23;22.5
2/1/2009;24; ...
...
While regular people won't find anything bad about this (it does indeed save some disk space!), this type of layout is actually unlogical: columns are supposed to be the different fields of data, not successive values. And it makes things more complicated when it comes to plotting...

I ran across this issue when trying to illustrate my previous post on gnuplot with a nice figure. I wanted to have a fancy "real-world data" illustration, so I downloaded electricity daily consumption datafiles from RTE (you can get those here). They provide Excel files per year, with 365 lines, one per day, and the power consumption for every half-hour (48 values per day). And guess what, the layout is just as described here...

So, first, before writing an adequate gnuplot script file, you need to transform columns into lines. And this is what this post is about, for Windows users. It can also be considered as a demo of what you can do with the Windows command-line interpreter.

A quick search shows some interesting material, mostly based on Linux tools (see here for example). And yes, Windows users, you'll need to get some new software, because Windows lacks some basic tools. At present we will only need the 'cut' tool, a binary can be downloaded through the gnuwin32 coreutils package.

What we need to do here is to process each line of the file, cut it into fields, and write a one datum per line output datafile. So, lets go first for the line-by-line processing, using the for command (you have of course already converted the file to .csv format):

for /F "delims=." %%a in (%file%) do call :sp1 "%%a"
Of course, you will have previously put the file name into the 'file' variable with set file=myfile.csv. The "delims" item is there just to get the whole line of data.

Then, we need to produce one output file for every column that contains a data value. This is done with the "numeric" version of the 'for' command:

--------------------------------------------
:sp1
set line=%~1
echo %line% >line.txt
for /L %%b in (1,1,48) do call :sp2 %%b
goto :eof
--------------------------------------------
Finally, cut the same line 48 times, and add each of the columns to the output files (column 1 is the date, column 2 is some non-significant data, the first value is in column 3):

--------------------------------------------
:sp2
set /A col=2+%1
"%app%" -d; -f1,%col% line.txt >> all.dat
goto :eof
--------------------------------------------
with the variable 'app' containing "c:\program files\gnuwin32\bin\cut.exe"

And that's about it, the output file all.dat will now contain one line per data point. The only thing left is to leave a line between each day, so gnuplot can figure out where the day stops, so sp1 is actually more like:

...
for /L %%b in (1,1,48) do call :sp2 %%b
:: gnuplot needs 1 blank lines to separate records
echo. >> all.dat
goto :eof
--------------------------------------------
Finally, we can plot the thing with some classical gnuplot scripting:

------------------------------------------------------
set title "France power consumption on mondays, 2008\n(data source : RTE)"
set xrange [0:47]
set xlabel "day period"
set ylabel "week"
set yrange [0:51]
set zlabel "MW" offset 0,3
set ztics 30000,10000
set datafile separator ";"
set style data lines
set grid
set pm3d
set surface
set hidden3d
fn="all.dat"
set view 42.0,56.0
unset colorbox
splot fn using 2 every 1:7 notitle
pause -1
set terminal png size 640,480
set output "RTE_2008_monday.png"
replot
------------------------------------------------------
Feel free to comment (english or french) if you'd like more details.


Aucun commentaire:

Enregistrer un commentaire