Creating and editing Madrigal data files

A key element in administering the Madrigal database is the ability to create and edit Madrigal data files. An ambitious Madrigal administrator could read the CEDAR Madrigal Hdf5 format description and write their own code from scratch. However, Madrigal provides API's and examples in two languages, Python and Matlab, to make this chore much, much easier. This section describes how to create Madrigal files using each of those two languages.

Python

How have things changed with Madrigal 3?

For the most part, python scripts that created Madrigal 2 files will only need to be slightly modified to create Madrigal 3 files. With Madrigal 3, the independent parameters (excluding time) must be explicitly declared. This is because Madrigal 3 files contain both a table layout, and (if there are independent parameters) a grid layout, with the number of dimensions = 1 + (number of independent parameters). The first dimension is always time. The independent parameters are declared in the MadrigalDataRecord init method.

The second small change is the addition of an optional argument arraySplitParms. If one of more array splitting parameters are given, then multiple grids of data will be created, one grid for each unique combination of values of the array splitting parameters. The idea behind this argument is to make the gridded data less sparse and more useful to the end user. For example, a phased array incoherent scatter radar such as PFISR that simultaneously measures along different beams would split its arrays by the parameter beam_id. This would make the gridded data both dense and more user friendly - a user could simply open the data for the beam of interest.

Instructions

This section gives an introduction to using the madrigal python API to create new Cedar Madrigal Hdf5 files, and to edit existing Cedar Madrigal Hdf5 files. Examples are given of creating new normal-sized files, creating new large files, and editing existing files. Users creating Cedar Madrigal Hdf5 files with python can choose between two sightly different patterns, one which maximizes speed but could have memory issues with very large files, and another that limits memory use with very large files, but is somewhat slower. Complete documentation is available as part of the Madrigal Python API documentation.

The python cedar module was written to simplify the task of creating Cedar Madrigal Hdf5 files as much as possible. The user of this module only needs to know the following about the Cedar format:

The high level object in the cedar module is MadrigalCedarFile. This class emulates a python list, and so users may treat it just like a python list. The restriction enforced is that all items in the list must be either MadrigalCatalogRecords, MadrigalHeaderRecords, or MadrigalDataRecords. Each of these three classes supports the method getType(), which returns 'catalog', 'header', and 'data', respectively.

Cedar catalog and header records can be difficult to create. With the cedar module CatalogHeaderCreator, creating these records should now be much easier. Catalog and header records contain a lot of information that can be deduced from the data - this includes which parameters are present, minimum and maximums of certain parameters, and start and stop times. The only parts of the catalog or header record that can't be determined automatically are some optional descriptive text fields. With this new module, you simply pass in strings of any length if you want to fill in one of those optional descriptive text fields - the module will handle all formating for you.

"""createSample.py shows an example of creating an entirely new Madrigal
file using the Python cedar module.  In particular, it creates a file with
a catalog record, a header record, and two data records.  The data records
contain two 1D parameters (System temperature - SYSTMP and Transmitter
Frequency TFREQ) and five 2D parameters (GDALT, GDLAT, GLON, and TR, and DTR)
The independent spatial parameter in this example is just GDALT.

This example uses the normal-sized pattern for speed, where normal sized is considered 
less than 3000 records.  In that pattern, all the records are appended to the MadrigalCedarFile 
object, and then write is called to write everything to file at once, including both the Table and 
Array Layouts.

$Id$
"""

import os, os.path
import types
import datetime

import madrigal.metadata
import madrigal.cedar

################# sample data #################

kinst = 30 # instrument identifier of Millstone Hill ISR
kindat = 3408 # id of kind of data processing
nrow = 5 # all data records have 5 2D rows

SYSTMP = (120.0, 122.0)
TFREQ = (4.4E8, 4.4E8)

GDALT = ((70.0, 100.0, 200.0, 300.0, 400.0),
         (70.0, 100.0, 200.0, 300.0, 400.0))

GDLAT = ((42.0, 42.0, 42.0, 42.0, 42.0),
         (42.0, 42.0, 42.0, 42.0, 42.0))

GLON  = ((270.0, 270.0, 270.0, 270.0, 270.0),
         (270.0, 270.0, 270.0, 270.0, 270.0))

TR    = (('missing', 1.0, 1.0, 2.3, 3.0),
         ('missing', 1.0, 1.7, 2.4, 3.1))


DTR   = (('missing', 'assumed', 'assumed', 0.3, 0.7),
         ('missing', 'assumed', 0.7, 0.4, 0.5))

################# end sample data #################

newFile = '/tmp/testCedar.hdf5'

# create a new Madrigal file 
cedarObj = madrigal.cedar.MadrigalCedarFile(newFile, True)

# create all data records -  each record lasts one minute
startTime = datetime.datetime(2005, 3, 19, 12, 30, 0, 0)
recTime = datetime.timedelta(0,60)
for recno in range(2):
    endTime = startTime + recTime
    dataRec = madrigal.cedar.MadrigalDataRecord(kinst,
                                                kindat,
                                                startTime.year,
                                                startTime.month,
                                                startTime.day,
                                                startTime.hour,
                                                startTime.minute,
                                                startTime.second,
                                                startTime.microsecond/10000,
                                                endTime.year,
                                                endTime.month,
                                                endTime.day,
                                                endTime.hour,
                                                endTime.minute,
                                                endTime.second,
                                                endTime.microsecond/10000,
                                                ('systmp', 'tfreq'),
                                                ('gdalt', 'gdlat', 'glon', 'tr', 'dtr'),
                                                nrow, ind2DList=['gdalt'])

    # set 1d values
    dataRec.set1D('systmp', SYSTMP[recno])
    dataRec.set1D('tfreq', TFREQ[recno])

    # set 2d values
    for n in range(nrow):
        dataRec.set2D('gdalt', n, GDALT[recno][n])
        dataRec.set2D('gdlat', n, GDLAT[recno][n])
        dataRec.set2D('glon',  n, GLON[recno][n])
        dataRec.set2D('tr',    n, TR[recno][n])
        dataRec.set2D('dtr',   n, DTR[recno][n])

    # append new data record
    cedarObj.append(dataRec)

    startTime += recTime                                 

# write new file
cedarObj.write()

# next, use the cedar.CatalogHeaderCreator class to add catalog and header 
catHeadObj = madrigal.cedar.CatalogHeaderCreator(newFile)
catHeadObj.createCatalog(principleInvestigator="John Holt", sciRemarks="Test data only - do not use")
catHeadObj.createHeader(analyst="Bill Rideout", comments="Do not use this data")
catHeadObj.write()

Creating a new large file example

"""createSampleLargeFile.py shows an example of creating an entirely new,
very large Madrigal file using the Python cedar module.  

In particular, it creates a file with
a catalog record, a header record, and one hundred thousand data records.  Because this
is a large file, it calls dump for every five hundred records, which writes all the data
to file in the Table Layout.  This limits the memory footprint to just over 1GB.
This example closes with a call to close(), which triggers the 
creation of Array Layout, which again keeps the memory footprint down by only reading
in a set number of records at a time.

The data records
contain two 1D parameters (System temperature - SYSTMP and Transmitter
Frequency TFREQ) and five 2D parameters (GDALT, GDLAT, GLON, and TR, and DTR)
The independent spatial parameter in this example is just GDALT.

$Id$
"""

import os, os.path
import types
import datetime

import madrigal.metadata
import madrigal.cedar

################# sample data #################

kinst = 30 # instrument identifier of Millstone Hill ISR
kindat = 3408 # id of kind of data processing
nrow = 5 # all data records have 5 2D rows

SYSTMP = (120.0, 122.0)
TFREQ = (4.4E8, 4.4E8)

GDALT = ((70.0, 100.0, 200.0, 300.0, 400.0),
         (70.0, 100.0, 200.0, 300.0, 400.0))

GDLAT = ((42.0, 42.0, 42.0, 42.0, 42.0),
         (42.0, 42.0, 42.0, 42.0, 42.0))

GLON  = ((270.0, 270.0, 270.0, 270.0, 270.0),
         (270.0, 270.0, 270.0, 270.0, 270.0))

TR    = (('missing', 1.0, 1.0, 2.3, 3.0),
         ('missing', 1.0, 1.7, 2.4, 3.1))


DTR   = (('missing', 'assumed', 'assumed', 0.3, 0.7),
         ('missing', 'assumed', 0.7, 0.4, 0.5))

################# end sample data #################

newFile = '/tmp/testCedar.hdf5'
try:
    os.remove(newFile)
except:
    pass

# create a new Madrigal file 
cedarObj = madrigal.cedar.MadrigalCedarFile(newFile, True)

# create all data records -  each record lasts one minute
startTime = datetime.datetime(2005, 3, 19, 12, 30, 0, 0)
recTime = datetime.timedelta(0,4)
for recno in range(100000):
    endTime = startTime + recTime
    dataRec = madrigal.cedar.MadrigalDataRecord(kinst,
                                                kindat,
                                                startTime.year,
                                                startTime.month,
                                                startTime.day,
                                                startTime.hour,
                                                startTime.minute,
                                                startTime.second,
                                                startTime.microsecond/10000,
                                                endTime.year,
                                                endTime.month,
                                                endTime.day,
                                                endTime.hour,
                                                endTime.minute,
                                                endTime.second,
                                                endTime.microsecond/10000,
                                                ('systmp', 'tfreq'),
                                                ('gdalt', 'gdlat', 'glon', 'tr', 'dtr'),
                                                nrow, ind2DList=['gdalt'])

    # set 1d values
    dataRec.set1D('systmp', SYSTMP[recno % 2])
    dataRec.set1D('tfreq', TFREQ[recno % 2])

    # set 2d values
    for n in range(nrow):
        dataRec.set2D('gdalt', n, GDALT[recno % 2][n])
        dataRec.set2D('gdlat', n, GDLAT[recno % 2][n])
        dataRec.set2D('glon',  n, GLON[recno % 2][n])
        dataRec.set2D('tr',    n, TR[recno % 2][n])
        dataRec.set2D('dtr',   n, DTR[recno % 2][n])

    # append new data record
    cedarObj.append(dataRec)

    startTime += recTime     
    
    if recno % 500 == 0 and recno > 0:
        cedarObj.dump() # this puts everything on disk and removes it from RAM
    if recno % 10000 == 0:
        # give some feedback
        print('At %i records' % (recno))
        
# finished adding new records
cedarObj.dump() # write whatever records are still in RAM
print('about to call close, which will also create the array layout') 
cedarObj.close() # this triggers creation of the Array Layout, then closes file

print('Finally, using CatalogHeaderCreator to add some descriptive text.')
# next, use the cedar.CatalogHeaderCreator class to add catalog and header 
catHeadObj = madrigal.cedar.CatalogHeaderCreator(newFile)
catHeadObj.createCatalog(principleInvestigator="John Holt", sciRemarks="Test data only - do not use")
catHeadObj.createHeader(analyst="Bill Rideout", comments="Do not use this data")
catHeadObj.write()

Editing an existing file example

"""editSample.py shows an example of editing existing data in a Madrigal
file using the Python cedar module.  In particular, it edits the sample file
mlh980120g.002.hdf5 to increase all Ti values by a factor of 1.2
"""

import os, os.path
import types

import madrigal.metadata
import madrigal.cedar

metaObj = madrigal.metadata.MadrigalDB()

orgFile = os.path.join(metaObj.getMadroot(), 'experiments/1998/mlh/20jan98/mlh980120g.002.hdf5')
newFile = '/tmp/mlh980120g.002.hdf5'

# read the Madrigal file into memory
cedarObj = madrigal.cedar.MadrigalCedarFile(orgFile)

# loop through each record, increasing all Ti values by a factor of 1.2
for record in cedarObj:
    # skip header and catalog records
    if record.getType() == 'data':
        # loop through each 2D roow
        for row in range(record.getNrow()):
            presentTi = record.get2D('Ti', row)
            # make sure its not a special string value, eg 'missing'
            if type(presentTi) != types.StringType:
                record.set2D('Ti', row, presentTi*1.2)

# write edited file
cedarObj.write(newFilename=newFile)

Matlab

To create CEDAR Madrigal Hdf5 files with Matlab, you will need to put the file MadrigalHdf5File.m on your Matlab path. You also need to run the Matlab script on a server with Madrigal installed, since ultimately Matlab will simply write your information to a *.mat file, and then call the python script createMadrigalHdf5FromMatlab.py to convert that *.mat file to the output Hdf5 file. The script createMadrigalHdf5FromMatlab.py is installed in MADROOT/bin as part of the Madrigal installation.

You may however, run MadrigalHdf5File.m on a non-Madrigal server, and save the output file as a *.mat file, rather than an Hdf5 file. You could then convert that *.mat file into a Madrigal file on a server with Madrigal installed, using the convertToMadrigal function also defined in MadrigalHdf5File.m.

There are only seven methods in the main class of this API, along with one helper method needed only if not running on a Madrigal server:

These seven methods are documented below, along with the helper method convertToMadrigal. An example Matlab script that creates a CEDAR Madrigal Hdf5 file is also given.

classdef MadrigalHdf5File
    % class MadrigalHdf5File allows the creation of Madrigal Hdf5 files via
    % Matlab.  The general idea of this class is to simply write out all
    % data for the file in a Matlab struct array.  Then the python script
    % createMadrigalHdf5FromMatlab is called from this scriptto create all 
    % metadata and alternate array layouts.  This keeps the amount of Matlab 
    % code here at a minimum.  See file testMadrigalHdf5File.m for example
    % usage.

   methods
        function madFile = MadrigalHdf5File(filename, oneDParms, ...
                independent2DParms, twoDParms, arraySplittingParms)
            % Object constructor for MadrigalHdf5File
            % Inputs:
            %   filename - the filename to write to.  Must end *.hdf5,
            %     .hdf, or .mat.  If .mat, writes a Matlab file that must
            %     later be converted to Madrigal using convertToMadrigal
            %   oneDParms - a cell array of strings representing 1D parms.  May be
            %     empty.  Example:
            %     cellstr(char('azm', 'elm', 'sn', 'beamid'))
            %   independent2DParms - a cell array of strings representing independent 
            %     2D parms.  May be empty (ie, {}). Examples:
            %       cellstr(char('range'))
            %       cellstr(char('gdlat', 'glon'))
            %       cellstr(char())
            %   twoDParms - a cell array of strings representing dependent 
            %     2D parms.  May be empty (ie, {}). Examples:
            %       cellstr(char('ti', 'dti', 'ne', 'dne'))
            %       cellstr(char())
            %   arraySplittingParms - a cell array of strings representing  
            %     parameters whose values are used to split arrays.  May 
            %     be empty, in which case set to {}. Example:
            %       cellstr(char('beamid'))
            %  skipArray - optional argument.  If set to true, no array
            %      layout created.  If false or not passed in, array layout
            %     created if any 2D variables.

     function madFile = appendRecord(madFile, ut1_unix, ut2_unix, kindat, ...
                kinst, numRows)
            % appendRecord adds a new record to MadrigalHdf5File.  It
            % returns the record number of the present row (first will be
            % 0)
            %  Inputs:
            %    madFile - the created MadrigalHdf5File object
            %    ut1_unix, ut2_unix - unix start and end time of record in
            %      float seconds since 1970-01-01
            %    kindat - integer kind of data code. See metadata.
            %    kinst - integer instrument code.  See metadata.
            %    numRows - number of rows of 2D data.  If all 1D data, set
            %    to 1
            %  Returns:
            %    the record number of the present row (first will be 0)
            %  Affects:
            %    Updates madFile.recordCount, appends to madFile.data the
            %    number of rows numRows with all data except stdParms set
            %    to NaN.  Use set1D and set2D to populate that record using
            %    recNum as index.

    function madFile = set1DParm(madFile, parm, value, lastRec)
            % set1DParm sets the values of 1D parm parm to value value for
            % record with lastRecord value lastRec

    function madFile = set2DParm(madFile, parm, values, lastRec)
            % set2DParm sets the values of 2D parm parm to value values for
            % record with lastRecord value lastRec

    function madFile = setCatalog(madFile, principleInvestigator, expPurpose, expMode, ...
                                      cycleTime, correlativeExp, sciRemarks, instRemarks)
            % setCatalog allows setting extra information in the catalog
            % record.  This method is optional.  Even if this method is not
            % called, the catalog record will contain a description of the
            % instrument (kinst code and name) and kind of data brief
            % description, along with a list of description of the
            % parameters in the file, and the first and last times of the
            % measurements.
            %
            % Inputs:
            %
            %    principleInvestigator - Names of responsible Principal Investigator(s) or 
            %       others knowledgeable about the experiment.
            %    expPurpose - Brief description of the experiment purpose
            %    expMode - Further elaboration of meaning of MODEXP; e.g. antenna patterns 
            %       and pulse sequences.
            %    cycleTime - Minutes for one full measurement cycle - must
            %       be numeric
            %    correlativeExp - Correlative experiments (experiments with related data)
            %    sciRemarks - scientific remarks
            %    instRemarks - instrument remarks
            %

    function madFile = setHeader(madFile, kindatDesc, analyst, comments, history)
            % setHeader allows setting extra information in the header
            % record.  This method is optional.
            %
            % Inputs:
            %
            %    kindatDesc - description of how this data was analyzed (the kind of data)
            %    analyst - name of person who analyzed this data
            %    comments - additional comments about data (describe any instrument-specific parameters)
            %    history - a description of the history of the processing of this file
            %

    function write(madFile)
            % write writes out the complete Hdf5 file to madFile.filename
    
function convertToMadrigal(matFile, madrigalFile)
    % convertToMadrigal converts a matlab mat file to Madrigal Hdf5 file
    %   Inputs:
    %      matFile - existing Matlab .mat file created earlier
    %      madrigalFile - madrigal file to create.  Must end *.hdf5, .h5,
    %     .hdf, or .mat.

% test/example script to exercise MadrigalHdf5File
%
% $Id: testMadrigalHdf5File.m 4644 2015-01-13 19:30:37Z brideout $
filename = '/Users/brideout/Documents/workspace/mad3_0/madroot/source/madmatlab/example.h5';
oneDParms = cellstr(char('azm', 'elm', 'sn', 'beamid'));
independent2DParms = cellstr(char('range'));
twoDParms = cellstr(char('ti', 'dti'));
% Use {} for independent2DParms and twoDParms if all scalar parameters
arraySplittingParms = cellstr(char('beamid'));
% Use arraySplittingParms = {}; for no splitting

% some hard-coded fake data
ut1_unix = 1.0E9;
ut2_unix = 1.0E9 + 100;
kindat = 3410;
kinst= 30;
numRows = 5;
azm = 100.0;
elm = 45.0;
sn = 0.5;
range = [100.0, 150.0, 200.0, 250.0, 300.0];
ti = [1000.0, 1100.0, 1200.0, 1300.0, 1400.0];
dti = [100.0, 150.0, 200.0, 250.0, 300.0];
beamids = [1,2,1,2,1,2,1,2,1,2];


madFile = MadrigalHdf5File(filename, oneDParms, ...
          independent2DParms, twoDParms, arraySplittingParms);
      
for rec = 1:10
    madFile = madFile.appendRecord(ut1_unix, ut2_unix, kindat, ...
        kinst, numRows);

    % set 1D values
    madFile = madFile.set1DParm('azm', azm, madFile.lastRecord);
    madFile = madFile.set1DParm('elm', elm, madFile.lastRecord);
    madFile = madFile.set1DParm('sn', sn, madFile.lastRecord);
    madFile = madFile.set1DParm('beamid', beamids(rec), madFile.lastRecord);
    
    % set 2D and independent variables
    madFile = madFile.set2DParm('range', range, madFile.lastRecord);
    madFile = madFile.set2DParm('ti', ti, madFile.lastRecord);
    madFile = madFile.set2DParm('dti', dti, madFile.lastRecord);
    
end

% add catalog and header info
principleInvestigator = 'Bill Rideout';
expPurpose = 'Measure the ionosphere';
expMode = 'Vector measurements';
cycleTime = 20.0;
correlativeExp = ''; 
sciRemarks = 'Big solar storm'; 
instRemarks = 'Working well!!!';
madFile = madFile.setCatalog(principleInvestigator, expPurpose, expMode, ...
                            cycleTime, correlativeExp, sciRemarks, instRemarks);

kindatDesc = 'Regular processing';
analyst = 'Phil Erickson'; 
comments = 'Include unusual parameter description here';
history = 'Reprocessed four times';
madFile = madFile.setHeader(kindatDesc, analyst, comments, history);

write(madFile);