The Process Configuration Manager (PCM) refers to two interfaces through which users describe processes that produce data products (e.g. PCM Processs GUI) and the data products that are produced. The description of the output products are referred to as Data Object Designs, thus the GUI is referred to as the PCM DOD GUI. Because the PCM is intended to expediate the transfer of a scientific algorithm into the ARM production processing system, it is expected the desired output product is well defined. As such th providesand how to subsequently run that process within a compilable project and with the data_consolidator application is described in the following sections. Flow charts diagramming the key development stages are available in ADI Development Steps section presented at the end.
The Process Configuration Manager (PCM) is the master interface from which a user can access the Process Definition and other tools used to view, edit, and define an ADI application’s process, input, and output configurations.
Currently loaded interfaces are displayed on the right panel, while access to ARM’s datastreams and processes is maintained on the left panel via the Processes and Datastreams tabs.
A user can have multiple tools open and be simultaneously viewing or editing datastreams, processes, and DODs.
Each instance of an active tool is maintained in a set of tabs located along the top of the PCM’s right panel as shown in the following figure The Process Configuration Manager (PCM). By default the PCM will enable the Datastream tab of the left panel and display the Intro tab on the right panel.
Note in the left frame, below the Filter the List heading, the Production and Development tabs are grayed out. This indicates that the list of datstreams displayed is a combined list of both production and development datsatreams. To filter out the development datastreams, select the Production button. To filter datastreams by a string, enter the string into the blank cell below the Production button.
The Process Configuration Manager (PCM)
From the PCM:Datastreams panel a user can view existing datastreams, their associated DODs, create new datastreams, and define DODs for a datastream. From the PCM:Processes panel a user can view an existing process, or create a new process.
Defining an ARM process consists of defining its inputs and outputs, and documenting where it will run through the Process Definition Tool component of the ADI PCM.
A Process Definition includes definitions of inputs, outputs, and operating parameters relating to the process versus an input, output, or transformation being applied to an input. A summary of the information needed to define an ADI process and information helpful in completing the process definition is presented below.
required
Name of the process. The executable will be the name followed by ‘_vap.’
Locations and Other Options Form
required
Each ARM site/facility pairing for which the process is valid to run. Can only run at facilities that are documented in the DSDB.
required
Email address to receive error and warning messages produced by the process.
required
The number of seconds of data that each interaction takes across the ADI modules that follow the initialization and prior to the finishing (retrieve, merge, transform, create output datasets, store data). Defaults to a single day (86400 seconds). If set to 0 the size of the chunk of data processed through the retriever, merge, process modules equals the size equal to the begin and end dates, plus any time offsets.
Fig: Input/Outputs Form
required
If the process will be retrieving data from netCDF data files, select ADI. Basic type is for non-netCDF input files. Load input netCDF files into the DOD as needed as described in Creating a New DOD by Importing from a NetCDF File.
required
Input datastream for an ADI type process are specified as part of the retrieval definition process. Select the Edit Retrieval button and complete the Retrieval Definition Form as described in `Defining a Datastream Class`_.
required
Select an existing datastream class and data level from the drop-down list of available values. Enter a new datastream base platform name and datalevel that conforms to ARM naming standards. If a DOD for that datastream is in the DSDB, an expandable reference that is also a link to the DOD interface page is displayed.
To define a new process, perform the following steps.
In the example shown below the VAP process created is ‘example_vap’, it runs at the sgp C1, sgp B1, and nsa C1 facilities, and produces the output datastream examplevap.c1. If the examplevap.c1 datastream has not been previously defined, saving the process information to the DSDB will result in the addition of the examplevap.c1 to the list of datastreams available from the PCM:Datastreams view. Note the Process Definition form is labeled as an ‘example_vap Process Form’ tab at the top of the right panel next to the ‘Intro’ tab.
To rename an existing process perform steps one and two from Defining a New Process, edit the name of the process and save the change.
It is not possible to duplicate a process. However, the Text Export/Import button displayed in the lower right hand corner of the Retriever Editor form can be used to copy all the Retriever Table database entries into another process. To fully duplicate the process, the attributes associated with the process will need to be reentered into the new processes Locations and Other Options, and Inputs/Outputs forms.
At the most basic level, defining the inputs to a VAP consists of documenting the name of a variable and the datastream from which it should be retrieved. Historically, a significant effort was expended performing pre-analysis data consolidation and transformations to prepare input data for scientific analysis. To minimize, if not eliminate, the need for VAP developers to perform such tasks, ADI allows a user to
Additional control is also provided to define input data source preference by site/facility pairing or time range dependencies.
The inputs of all VAPs must be specified using a retriever process. As such, by default the ‘This process uses a retrieval for its input configuration’ box will be checked and an ‘Edit Retrieval’ button should be evident in the lower left side of the right panel. Selecting this button will bring up the Retrieval Definition form shown in the Retrieval Definition Form as sown in the following example screen capture. Note that the Retrieval Definition form has replaced the Process Definition form but it is still organized under the ‘example_vap VAP Process’ tab.
To return to the main Process Definition form select the ‘Done’ button at the bottom left of the right panel.
Note
Retriever data will not be stored in the DSD until the ‘Save’ button in the Process Definition form has been selected.
The Retrieval Definition form allows the user to not only specify the variable and data source from which to retrieve a variable, but also to perform some basic transformations of units and data type. These options are checkboxes in the bar above the table. Selecting an item adds the data entry column to the table. In addition to the transformations, the bar also allows the user to retrieve data for a particular variable for some extra time before and/or after the process period specified in the command line, and to automatically retrieve the companion QC variable. A description of each of the columns in the Retrieval Definition table is given below.
| ADI Retrieval Definition Form Parameters | ||
|---|---|---|
| Process Element | Required | Description and Comments |
| Source(s) | Yes | Datastream Source(s) is the datastream(s) from which the value(s) for the variable should be retrieved. Populated via the Data Sources Definition form. A single value can be retrieved from a prioritized list of preferred and alternate datastreams. (in Figure 3.2 the first_cbh variable is retrieved from either the vceil25k.a1 or vceil25k.b1 based on the indicated conditions and correlated to a user) defined variable ‘cloud_base_height’). |
| Variable Name | Yes | The Variable Name consists of the user defined name of the variable to be retrieved, and an indication of whether finding the variable in one of the specified input data sources is a requirement that must be met for the VAP to successfully run. Variable names in the ‘Variable Name’ column must be unique. If the ‘Required’ check box is marked, the VAP process will fail to run for a given observation (i.e., input data file) unless the variable specified is successfully retrieved. If the ‘Required’ check box is marked, an asterisk will follow the Variable Name. This will be the name to which the retrieved data will be referred to in the DSDB and in auto-generated code. It is not necessary for this name to match the name in the datastream(s) from which the variable is retrieved. Coordinate dimension variables (i.e., time, height, range, etc.) should not be included in the Retrieval Definition table, as all coordinate dimensions of retrieved variables are automatically retrieved. This automatic retrieval is only successful when the dimension name and variable name in the input datastream file are identical. |
| Coord System | No | Coord System is the name assigned by developer to the coordinate system for a given variable. The parameters associated with a coordinate system are assigned via the Coordiante System Definition Form. A transformation method must be defined for each dimension of a variable’s coordinate system. ADI supports two methods of assigning a coordinate system to a given dimension; (1) to assign a uniform system (i.e. a coordinate system characterized by a constant interval between all samples of the dimension) (2) a mapping (a coordinate system not explicitly defined, but indicated by selecting a coordinate variable from another datastream to which a retrieved variable’s dimension will be transformed). These are more fully discussed in Coordinate System Definition Form Overview It is recommended that all retrieved variables passed through to an output datastream, even when the input and output coordinate systems are identical, have an explicitly name and are defined using a mapping or static values. For cases where the output coordinate system is the same as that of the input datastream, it should be defined as a mapping onto itself. This will fill gaps in data to create a more complete file. |
| Outputs | Yes | The name of the output datastream(s) and level(s) that a retrieved variable will be propagated to as part of the data consolidation process, and the name of the variable as it will be found in the output datastream(s). Populated via the Output Field Mapping Form. The output datastream(s) are prepolated with all possible output datastreams documented in the Inputs / Outputs section of the Process Definition Form. For a retrieved variable to exist in a output datastream, the name must be entered into the empty cell adjacent to the datastream name and level in the toutput Field Mapping Form. |
| Units | No | Specifies the units into which the retrieved data will be converted. Units are converted using Unidata’s UDunits library DEFAULT value results in units staying the same as found in the input file from which the variable is retrieved.. Units are entered free form. Please reference Unidata’s web page for further information: http://www.unidata.ucar.edu/ software/udunits/udunits-2/ udunits2.html. |
| Data Type | No | A drop list of possible data types into which the retrieved data can be converted. If a value is provided the data type will default to type float. If the data type remains as a default value through the population of the Data Sources Definition form, and a field is selected from the drop list of available values, the data type will be updated to the type of the selected field as found in the specified datastream. If the default value is overridden in the Retrieval Definition table, the data type will not be updated as a result of field selections in the Data Sources Definition form. |
| QC | Yes | Indicates whether the companion QC variable will be retrieved in addition to the variable noted in the Variable Name and whether if successfully finding the companion QC variable is a requirement for the VAP to run. It is assumed that the companion QC variable will be equal to the name of the variable in the input datastream file preceded by ‘qc_.’ If the ‘Required’ check box is marked, the VAP process will fail to run for a given observation (i.e., input data file) unless both the variable and its QC is successfully retrieved. |
| Offsets(Seconds) | No | If both the input and output bins do not both line up with the processing interval boundaries, to be absolutely sure you get all the input data you need outside the edge of a processing interval you will need to define offsets to [size of input bin] + [site of transformed bin]. This will retreive enough data including the worst case of dia- metrically opposed alignments (alignment of 0.0 in one, and 1.0 in the other). Allows a user to retrieve additional data for each processing interval either before the interval or after. This includes before the begin date, or after the end date of the begin and end date entered at the command line at run time. The begin_date and end_date values are for the “current processing interval” and are not adjusted by the offsets. All records with times before begin_date or after end_date are records within the specified offsets (for normal daily processing these would be from the previous day or the next day) day). Note that begin_date and end_date are input parmeters to all user hooks. If an offset defined at the start of 60 secs, and an offset at the end of 60sec for sample interval of 60sec, then the samples will go from 0 to 1441. But the output files created will still be 1440 in size and consist of the samples 1 to 1440. Typically used to provide a buffer of data to a type of analysis that needs to see a larger window of data than the processing interval of the ADI process. Despite the processing being over the entire period, the output file will only be over the processing interval. |
The Data Sources Definition form allows a user to define the source(s) of the data to retrieve and assign to the user defined variable. It allows for lists of preferred and alternate data sources, multiple possible variable names, and location and time dependencies. A description of each of the columns in the Data Sources Definition form is given below.
| ADI Data Sources Definition Form Parameters | ||
|---|---|---|
| Process Element | Required | Description and Comments |
| Priority | No | Integer representation of priority when alternative Datastream Sources are specified. When priority is not populated, the first row is the highest priority and the last is the lowest. Dragging and dropping the rows into the desired order is another way to adjust priority. |
| Datastream Class | Yes | Datastream from which the variable with the name noted in the ‘Field(s)’ column will be retrieved. Must be populated first before any of the other elements in the Data Sources Definition form can be populated. |
| Field(s) | Yes | Name of the variable to retrieve as found in the datastream defined as the Datastream Class. Initially populated with a default value equal to the user defined Variable Name from the Retrieval Definition form. This default value is noted by brackets. Value defaults to the user defined Variable Name in the Retrieval Form. If the datastream is loaded into the history database, clicking on the Fields cell will bring up a drop list populated with all possible variable names. If not, the user should enter the desired variable name followed by a <return>. If more than one variable is entered into the Fields column, the retriever searches the input datastream file for each of the variables in the order listed, until one is found. The variable names shown in the drop list reflect all the variables that have existed for that datastream over all time, not just the variables in the datastream’s latest DOD. |
| Location | No | The variable names shown in the drop list reflect all the variables that have existed for that datastream over all time, not just the variables in the datastream’s latest DOD. |
| Location Dependency | No | Used when the datastream from which to retrieve data is a function of the site/facility at which the VAP is being run. |
| Time Dependency | No | Used when the datastream from which to retrieve data is a function of what period the VAP process is running. If a begin or end time dependency is not selected, the time dependency defaults to the beginning of the datastream or end of the datastream respectively. |
An example of both a location and time dependency is illustrated in the preceding figure. In this example, when the VAP is run for sgpB4, the user defined variable ‘cloud_base_height’ will be correlated to the first_cbh variable in the vceil25k.a1 datastream. If it is not running at sgpB4 and the date being processed falls before April 1, 2001, the user defined variable ‘cloud_base_height’ will be correlated to the variable ‘first_cbh’ in the vceil25k.a1 datastream. For process times April 1, 2001 or greater, and when processing at sites other than sgpB4, the user defined variable ‘cloud_base_height’ will be correlated to the first_cbh variable in the vceil25k.b1 datastream.
This form is accessed by double clicking a cell in the Retriever Editor, Output(s) column. It consists of row for each of the possible output datastreams with a drop box containing all the variables in that output DOD. To associate a variable in the Retriever Editor to a specific output variable simply select the desired variable from the drop box next to the datastream. This will result in the values associated with the retrieved variable being mapped to selected variable in the output datastream.
In most cases, a new coordinate system can be fully defined via the coordinate system definition form. To transform a variable to a new coordinate system means to define new values for one or more of the variables dimensions, and update the variable’s values to reflect the new ‘grid’. The coordinate system of the retrieved variable will be referred to as the ‘source’; the coordinate system of the new grid will be referred to as the ‘target’. The form supports 3 transformation types (1) averaging, (2) interpolation, and (3) nearest subsample. The parameters that can be specified via the form are documented in the following table for each transformation type.
| General Coordinate System Definition Form Parameters | ||
|---|---|---|
| Process Element | Required | Description and Comments |
| Variable(x,y, ...n) where x,y, ...n represent the dimensions that make up target coordinate system. | Yes | Name of each dimension for the retrieved variable. The order of the dimensions must match the order of the dimensions of the retrieved variable. If the name of dimension is to be changed, the new name should be entered. |
| Coordinate system name | Yes | The name of the coordinate system as stored in the CDSTrans structure and named in ADI templater generated header files. If no transformation is performed on a retrieved variable’s dimensions, then a CDSin structure is used to store the information and a coordinate system name is not needed. |
| Units | No | If set, the dimension will be converted to the indicated units prior to the transformation. |
| Data type | No | If set, the data type will be converted to the type indicated prior to the transformation. |
| Use mapping | No | Control button. If selected, it updates the form to display a table from which the user can select the datastream’s grid, onto which the indicated dimension will be mapped. If not selected, drop boxes and cells necessary to define a uniform grid are displayed. |
text
| Uniform Grid Coordinate System Definition Form Param | ||
|---|---|---|
| Process Element | Required | Description and Comments |
| Transform type | No | Allows user to select the type of transform applied, such as average, interpolation, subsample etc. By default if output bins are larger then input bins then the data is averaged, if output bins are smaller then data is interpolated, if bin size is the same no transformation is applied. |
| Bin alignment | No | Tells you where in the bin the coordinate variable for the dimension is located in the context of ‘beginning, middle, and end’ values. Default value is middle. |
| Interval | Yes * | Specifies the difference between two values of the given coordinate variable to generate a uniform grid. |
| Start | Yes * | The value of the coordinate dimension for the first element in the output grid. |
| End | Yes * | The value of the coordinate dimension for the last element in the output grid. |
| Length | Yes * | The number of bins, or distinct values for the coordinate dimension. For the dimension time this equals the number of samples in the file. |
| Transform type | No | Allows user to select the type of transform applied (average, interpolation, subsample, etc.) By default if output bins are larger then input bins then the data is averaged, if output bins are smaller then data is interpolated, if bin size is the same no transformation is applied. |
For the interval, start, end, and length parameters, the user sets three of the four and the last is calculated and automatically set.
| Mapped Grid Coordinate System Definition Form Parameters | ||
|---|---|---|
| Process Element | Required | Description and Comments |
| Datastream group | Yes | The datastream to map to is determined by the user entering the name of the datastream group for which the target datastream is the highest priority datastream. |
In addition to the parameters provided in the form, additional parameters can be defined in a configuration file to further refine the transformation. Each of the transformation types, and the flat file that can be used to define them are discussed in detail in Transforming or Regridding Retrieved Variables onto a New Coordinate System.
The entries on the Coordinate System Definition form support the two most common types of transformations, averaging and interpolation. Through this form, the target grid can be defined in one of two ways:
The former is referred to as a uniform transformation, the latter, a mapped transformation. Unless the transform type is explicitly defined in the transform configuration file, the libraries determine whether an averaging or interpolation transformation is needed. If the target grid bins are larger than the source grid bins, the data will be averaged to match the new grid. If the target grid bin size is smaller, then interpolation will be applied. If either or both grids are irregular, then ADI will attempt to guess which default transformation should be used based on the average interval over the whole span of the grid.
The coordinate system in the figure above is an example of a uniform transformation of the time dimension. It has been assigned the name “thirty_second”, transforms the dimension time onto a uniform grid that starts at 0 seconds, grows in increments of 30 to 86370, with a total 2880 values.
Retrieval Definition table variables and data sources can be populated by either:
Manual entry is more efficient where there is more than one datastream from which to retrieve the variable. When a variable’s source is a single datastream (no alterate sources if that datastream is unavailable), it is more efficient to access the DOD of the input datastream and drag and drop variables onto the Retrieval Definition table.
Note
do not retrieve netCDF standard (lat, lon, alt) or coordinate dimension variables (time, height, range) for a retrieved variable as these will be automatically retrieved.
Note: Do not retrieve netCDF standard (lat, lon, alt) or coordinate dimension variables (time, height, range) for a retrieved variable as these will be automatically retrieved.
Select the green plus symbol located to the left of the table form.
Select the ‘custom_field_1’ variable and enter the name of the variable to retrieve.
Indicate whether the variable must be found for the VAP to run via the ‘Required’ check box.
Select ‘Source(s)’ [NONE] in the Datastream column
Select the pencil icon to bring up the Data Sources Definition form.
Select the Datastream column corresponding to the Field with the value of the variable you just defined to bring up a drop list of possible datastreams.
If the data source is a single datastream with no alternative sources:
If the data source is a single datastream with alternative sources based on datastream availability:
If the data source is a single datastream with alternative sources based on location or time dependencies:
If the name of the variable found in the Datastream Class datastream does not match the default value, update the entry in the ‘Field’ and select the desired variable name.
If a second value is to be retrieved and correlated to the user defined Variable Name in the Retrieval Definition form (meaning the user defined variable in the Retrieval Definition form will be an array of more than one value) specify the data sources and associated values of the additional values as follows:
9. Close the Data Sources Definition window by selecting the ‘x’ in the upper right corner of the window. Return to the Retrieval Definition form and specify additional variables to retrieve by adding new rows to the Retrieval Definition table.
You can populate the Retrieval Definition Table by Dragging and Dropping from Input Datastream DODs. For example, from the datastream and the variables you can drag and drop the desired variable into the Retriever Definition table. The Source(s), Variable Name, and QC retrieval status will be populated. Update these as required.
For the example VAP we will build in this tutorial, we will retrieve first_cbh, qc_first_cbh, and backscatter variables from the vceil25k.b1 datastream. The first_cbh will be saved into a user defined variable name of ‘cloud_base_height’ and written to the output netCDF file with that name. If that datastream is unavailable the variables will be retrieved from the vceil25k.a1. The units of the first_cbh will be converted to centimeters and the successful retrieval of the QC variable will be required for the example_vap process to run. The Retrieval Definition table for example_vap, with the Data Sources Definition form open for the first_cbh variable, is shown in the following figure.
The duplicate entry icon (paper sheets icon) in the Retrieval Definition form was used to add the backscatter variable since it is retrieved using the same Data Sources Definition query (i.e., sets of possible input datastreams). The duplicate entry was updated as appropriate for the backscatter variable (update the Variable Name, Field(s), and Units). The coordinate dimensions of the retrieved variables (time and range) and lat, lon, and alt are not included in the Retrieval Definition table as they will be automatically retrieved. Note that the Location Dependency and Time Dependency check boxes in the Data Sources Definition have been deselected as they are not applicable to this example.
This section will have documentation on details of transformation.
The input data retrieval specifications are saved to the DSDB from which it is accessed by the ADI templater application, create_adi_project, to create the project source code files used at run time by the VAP. Note in Figure 3.5 that the user defined variable names that are retrieved from the input datastreams are summarized above the ‘Edit Retrieval’ button.
To save retrieval data to the DSDB, perform the following steps.
To facilitate the storage, exploration, and retrieval of data associated with a datastream all ARM datastreams must have an instrument code and the datastream’s DOD must be stored in the ARM MetaData database. DODs can be submitted to the ARM MetaData database via the PCM DOD GUI through a link to the ARM MetaData Service (MST)submission form located in the upper right corner labeled ‘Submit for Review’ in green text. If the DOD has already been submitted and approved the link is labeled ‘View Review[APPROVED]’. If submitted but not approved the link reads ‘Under Review’. For VAPs that are to be released to the ARM production processing system, a developer responsible should work with their translator on populating this form. Figure 4.1 shows the Datastream Information page for the examplevap.c1 output datastream of the example_vap process.
CAUTION: Do not perform this step unless you are working with an ARM science translator on a production ARM VAP.
Because the PCM is intended to expediate the transfer of a scientific algorithm into the ARM production processing system, it is expected the desired output product is well defined. In addition, for an ADI VAP to run the DOD all of its output datastreams need to be loaded into the DSDB. Therefore it is recommended that a new process be initiated by first defining its output product in the PCM DOD GUI.
A new DOD can be created by either starting from a blank template or by importing an existing netCDF file. If importing a DOD from a file, the file must follow arm data file naming standards, and the site and facility of the datastream must be loaded into the DSDB. Once loaded into the DSDB, the DOD(s) should be updated to conform to ARM DOD standards (add reference once it exists).
DODs associated with a datastream can be accessed via the ‘Datastream’ tab located in the left frame of the PCM. Datastream’s with DODs loaded into the are noted with a triangle that if selected, expands to list the datastreams. As shown in Figure 4.1, the examplevap.c1 datastream does not currently have a DOD associated with it.
The process of completely defining a datastream’s DOD includes creating the DOD, adding variables, adding/editing variable attributes, adding global attributes, and resetting variable attribute and global attribute values that will be populated by the ADI libraries or the VAP algorithm. All of these steps will be required for VAPs with DODs created through the use of a template. Whether, and if so, how much a DOD loaded by importing from an existing netCDF file will need to be edited is a function of the contents of the netCDF file imported and whether the DOD meets ARM’s standards.
To create a DOD you can either create a new one from a blank template, or by importing one from a netCDF file. We discuss both techniques in the following steps.
- Select the disk icon located at the top of the DOD (Figure 5.2) to save the DOD to the DSDB.
- Enter the output datastream name (base platform name and data level) of the datastream to which the DOD will belong.
- Enter the version of the DOD, and a comment describing the DOD (Figure 5.4).
Note: More than one DOD version can be defined for any given datastream.
- Select the Save button to save the DOD to the DSDB.
Note: If an inappropriate name is selected the datastream name and level box shown in Figure 5.4 will be highlighted in red and the DOD will not be saved.
Users do not have to specify an output datastream if they specify a retrieve and set a datastream in the process portion of PCM. They do not have to create a DOD, a DOD will be created on the fly. The resulting DOD will not exist in the database, but it will exist in output data file.
However, a datastream name must be included in the Process Output Datastream Classes in the PCM process definition. That datastream should not have a DOD associated with it.
In addition in the PCM Retriever Editor variable names must be provided in Output(s) column. Thus, the user sets up the mapping to the output as if a DOD for the target datastream existed.
If a DOD exists in the database then dynamic DODs will have no effect. Current functionality is such that if a DOD exists, and they select the dynamic DOD option, then the existing DOD will be updated with the additional information in the retriever that was not defined in the DOD. The DOD wins and nothing is overwritten that already exists.
Dynamic dods are generated using the –dynamic-dods command line option.
Dimensions:
All dimensions that comprise a variable’s coordinate system must be defined in the DOD. A full definition includes not only meeting the netCDF requirement of documenting dimension name and length, but also, when possible, includes creating a variable whose name matches that of the dimension, whose dimension is the dimension itself, and whose values document the possible values of that element of another variable’s coordinate system. For example, if a measurement in a file is dimensioned by time and height, where time is the unlimited dimension and height has a value of 10, a variable named ‘height’ should be created that records the 10 heights of the temperature variable’s coordinate system.
Variables:
Output variables include not only variables representing the primary measurements produced by the VAP analysis, but also all input variables whose values are used in determining the primary measurements. These variables are referred to as ‘passthrough variables’.
Variable attributes that must be defined for all variables include the variable name, dimensions, data type, long_name, and units. Attributes that should exist if they are applicable to the variable include valid_min, valid_max, valid_delta, resolution, and missing_value. Optional attributes that can be added to provide additional detail include: comment(s), precision, accuracy, and uncertainty. In addition to this attributes, an indication of whether the variable has a companion QC variable and whether it is a primary measurement can also be made.
Global Attributes:
Global attributes are metadata that apply to all variables in the file and over all time samples. Typically metadata should not have attributes attributed to it because it is an attribute. As a result, if a global attribute has a unit then it should probably be defined as a static variable rather than a global attribute so that its units can be properly defined in a variable attribute.
Global attributes whose values need to be assigned at run time should not be assigned a value in PCM DOD Definition form. If a global attribute has a value in the PCM, that value cannot be overwritten during processing. If the global attribute has a value equal to the value attempting to be assinged, the process will run successfully. The ADI library sets some ARM standard global attribute values at run time. To ensure these attributes can be properly assigned values during run time, the ADI libraries verify they do in fact have NULL values in the PCM. If they do not, the process will fail with the error “Invalid Global Attribute Value”. The following global attributes are set by the ADI libraries during run time:
command_line, process_version, dod_version, site_id, facility_id, data_level, datastream
Variables, dimensions, variable attributes, and global attributes can be added to a datastream’s DOD by either adding a new DOD element or by dragging and dropping DOD elements from an existing datastream’s DOD. Frequently it is more efficient to add passthrough variables, and the variables correlating to their coordinate dimensions to a DOD by dragging and dropping the variable from its input datastream. The drag and drop method is helpful for any DOD element if it closely resembles an element in an existing datastream. The following tools are useful for editing the DOD variable attributes:
Note: This can be used to insert a dimension, variable, or global attributes.
By inserting a New DOD Element.
By Dragging and Dropping from an Existing Datastream.
Note: the Retrieval Definition form can be accessed from the VAP Process window tab located at the top of the PCM page. Return to the DOD window by selecting the DOD tab.
All elements and their attributes, regardless of how created, (imported, as a new element, or copied from an existing datastream) should be reviewed to ensure they meet ARM standards and have been properly defined. The primary measurement variable check box should be selected for all primary measurements so that this characteristic will be stored in the DSDB. Process Definition discusses variable adding and editing variable attributes in context of the drag and drop example, but the updates for a variable created from scratch are done in a similar manner.
Note: When editing the DOD Variable elements:
For the example_vap process being developed, the passthrough variables were populated in the examplevap.c1 DOD v0.0 through a drag and drop process. Figure 5.7 shows these variables expanded with their original values. Figure 5.8 shows the same variables after they have been edited for the change in name and units of the cloud base height variable, and to meet VAP QC standards. To expedite the process, the attributes of the qc_first_cbh were copied from another *c1 level datastream. Note that in the following figure, the variables added to the basic template are noted in green, and in Figure 5.8, existing attributes that have been updated are indicated in red and new attributes are highlighted with a green bar.
Passthrough variables before editing.
Passthrough variables after editing.
For the example VAP we will add one new variable ‘new_cloud_base_height’ that is equal to the ‘cloud_base_height’ variable divided by 10 to convert its units to meters. The ‘new_cloud_base_height’ and ‘qc_new_cloud_base_height’ variables were created by duplicating the ‘cloud_base_height’ variable (select the cloud_base_height variable and then the paper duplicate icon). It is then edited to update its name and attribute values. The new values and their attributes are shown in the following figure.
The figure above shows the completed example_vap.process DOD.
Complete the definition of the Datastream by addressing the error, warning, and notice indicators in the DOD that appear to the left of the variables as red, yellow, and grey circles respectively. The colors indicate the confidence that a fix is needed and severity of leaving something unfixed. With red indicating it is highly likely a fix is needed and not fixing it could significantly impact a user’s or automated processing application’s ability to understand and utilize the information. Yellow indicates a fix is likely needed, but with known exceptions, so it is never auto fixed. Grey indicates a low impact, possibility subjective recommended change and not an indication of a problem. Mousing over the indicator will reveal a pop up describing the issue. If the pop up is ‘+ 1 child notice’, expand the variable and the indicator will move to the appropriate attribute of the variable and the pop up will be updated with the specific issue.
In Figure 5.10 the error ‘Missing or invalid coordinate dimension for this dimension’ occurs because the ‘backscatter’ variable is dimensioned by time and range. While the range dimension was automatically added to the DOD, ARM standards require that also be a variable ‘range’ included in the DOD if it can be defined. This issue can be resolved by returning to the input datastream, and dragging and dropping the ‘range’ variable into the DOD.
It is necessary to clear all variable attributes and global attributes whose values are populated by the VAP process itself otherwise changes in these will be associated with changes in DOD version. More importantly, ADI shared libraries will overwrite the values populated by the VAP process with the values indicated in the output DOD. As a result of a DOD that is populated by importing from an existing netCDF file, if a user does not delete the value of the ‘history’ global attribute, the history value from the imported file will be propagated to all output files. The error, “Attribute should not be NULL (set at run-time)”, indicated by a red circle, flags such problems. Attributes that are known to be set at run time are automatically deleted by the auto-fix function executed via the check box icon. The DOD should be reviewed for non-standard attributes whose values are defined by the VAP process and manually set their values to NULL.
CAUTION: Periodically save the DOD. Exiting the Processing Configuration Manager web page without saving will result in the loss of data.
Processes defined in the PCM can be run either using the data_consolidator application, or by developing an application specific to a particular process. If your intent is to simply consolidate data from existing ARM netCDF datastreams with the application of unit and data type conversions and coordinate dimension transformations, the Data Consolidator application can run the process immediately after the required information has been documented in the PCM, with no need for the user to write or compile any code. If the data product you want to create requires modifying the input data in a manner not supported by the PCM, or includes variables that cannot be derived from the input via the PCM, a software project specific to that process is needed.
create_adi_project is a source code generation tool that uses the PCM database entries to create a C, IDL, or Python software project for processes defined in the PCM. The scripts for the project are created from a script generator, can compile and run with no additional code producing netCDF files with all variables that can be derived from the database entries made via the PCM. The source code produced has hooks into which the user can insert their own code, thus, jump starting the development of their ARM Value Added Products (VAPs).
ADI shared libraries use environment variables to determine the location of data, configuration files, and binaries. Users developing their own application or running the data_consolidator appliction will need to set both the data related environment variables described below and those process related ones that follow. Any required subdirectories that ADI expects at the environment variable location are listed.
Base directory for data. On the ADI development system it is suggested that this directory be defined as /data/home/<user> /data. Subdirectories: conf, datastream, logs, quicklook: example: /data/home/gaustad/data
Location for ARM netCDF data. This should be defined as $DATA_HOME/datastream Subdirectories: /<site>/<datastream> where example with subdirectories: /data/home/gaustad/data/datastream/sgp/sgpsirsC1.b1
Location of logs generated during run. This should be defined as $DATA_HOME/logs. Subdirectories: Frequently organized in <site> subdirectories.
Location of input data if it is not in $DATASTREAM_DATA
Same as DATASTREAM_DATA but only used to find the input datastream directories.
Same as DATASTREAM_DATA but only used to find the output datastream directories.
Additional environment variables whose default values should not be changed, but may be useful for developers writing additional source code to perform analysis are documented below. Two locations are provided for storing configuration files used by the process. Where a configuration file should be stored is a function of whether the configuration file is routinely updated, or whether it is mostly unchanged (i.e., may be updated a handful of times a year).
Location of configuration files that do not change over time, or change at most once a year. As such these files can and should be maintained in the VAP’s GitLab repository.
Location for configrations files that change more than once a year. Within this directory files can be organized by site in $CONF_DATA/<site>/<site><process_name><facility> or by vap in $CONF_DATA/<vap>/
Because the files in $CONF_DATA are not released, an alternative method of installing them on the production processing system is needed. There are two possible methods of updating files in CONF_DATA (1) create a stand alone task in ServiceNow to have the system administrators copy them into the desirect location (2) Use doorstep to install the configuration files. Details for both methods are described below.
This method is recommended when the file will be updated infrequently (a few times a year) or that only need to be transferred to production once when the VAP (or a new site for the VAP) is setup because subsequent updates will be done automatically by the VAP process.
Request to transer files to production should be made via a ServiceNow. Preferably in an ENG or EWO associated with the VAP, or if those are not available in a stand alone incident. Describe where the files should be installed in $DATA_CONF and the location that the files to transfer to production can be found and assign to ADC system administrators.
!!This method can currenlty ONLY be used to install files to $CONF_DATA/<site>/<site><process_name><facility>!!. As such it only supports installation of conf files that require a seperate file for each site and facility. To use this method
The ‘data_consolidator’ is an application that performs the transformations and mappings from retrieved variables to output variables for any process defined in the PCM. As such it allows users to consolidate data from diverse datastreams without the need to create or compile any source code. It takes as input the name of the retriever process whose retrievals, transformations, and input to output mappings are to be applied and the typical ARM process arguments.
The data_consolidator command line arguments include the typical arguments for any ADI process with the addition of “-n <process>” to specify the process. The frequently used arguments include:
-n <process>
-s <site>
-f <facility>
-b <begin date>
-e <end date>
-a <database> (possible values are "dsdb_ref" and "devws")
-D <debug level> (level 2 will dump retrieved, transformed, and output structs)
-P (to log provenance)
-R ( reprocessing flag to allow the overwrite of previously created netCDF files).
Additional arguments include:
--log-dir <path> (path to the log file directory)
--log-file <file> (name of the log file)
--log-id <id> (replaces the timestamp in log file name with the specified id)
--max-runtime <seconds> (sets the max runtime for the process, 0 disables max runtime check)
--files <file1,file2,…> (for ingests only, specifies comma delimited list of files to process)
--asynchronous (disables the process lock file, disables check for chronological data processing, disables overlap checks with previously processed data, forces a new file to be created for every output dataset).
--dynamic-dods (creates a dod on the fly when the process does not have one assigned to it in the PCM). This requires the following
- A datastream name be entered into the PCM Proceses Inputs and Outputs
form that does not have a DOD associated with it. The output
file will use this datastream name.
- The PCM Retriever Editor have entries in the Outputs column that
map the retrieved variable to the output datastream.
The names provided in the mapping will be the names used in the
output file produced.
With respect to (-e), end date, please note that the process will run for the date specified as the begin date up to the end date (i.e., NOT through the end date).
If the debug level is set to two (-D 2), the data_consolidator app will dump the contents of the retrieval, transform, and output structures to a subdirectory ‘debug_dumps’. The dump files created and the structure they contain are listed below.
<site><process_name><facility>.YYYYMMDD.HHMMSS.post_retrieval.debug
<site><process_name><facility>.YYYYMMDD.HHMMSS.pre_transform.debug
<site><process_name><facility>.YYYYMMDD.HHMMSS.post_transform.debug
<site><output_datastream_name><facility>.<output_datastream_level>.YYYYMMDD.HHMMSS.process_data.debug
After the VAP process has been fully defined in the PCM and saved to the DSDB, the create_adi_project application can be run to create a C, IDL, or Python project comprised of a main module, hooks for the ADI Data Processing Modules <https://engineering.arm.gov/ADI_doc/framework.html#data-processing-modules>’_, supporting files documenting retrieved, transformed, and output variables, and Makefiles needed to compile the VAP binary. There are three templates that create a full project, two supporting VAP development and an ingest template. The VAP templates include a ‘transform’ template that creates a project that will include a call to the ADI transformation module, and a ‘retriever’ template that does not include the call.
The required input parameters for the create_adi_project include the specification of the process for which templates are being produced, the template type, and the directory into which the templates will be created. Optional input parameters are provided to document the source code with the developers contact information, to produce a dump of the DSDB elements associated with the process into a json data file, and to run from such a json dump rather than accessing process information from the DSDB. A complete summary of the create_adi_project command line options is shown in the following table along with an example.
| create_adi_project Usage | ||||
|---|---|---|---|---|
| Input Arguments | Argument Value | Req | Argument Description | |
| -h | –help | N/A | ||
| -p | –process | <process> | Yes | Name process defined in PCM |
| -t | –template | <template> | Yes | Type of template to create |
| -o | –output | <output directory> | Yes | Directory location to place templates |
| -d | N/A | Dump json data from webservice | ||
| -i | –input | <input file> | No | Json file to use instead of getting process params from DSDB |
| -a | –author | <name> | No | Developer’s name |
| -n | –phone | <phone> | No | Developer’s phone |
| -e | <email> | No | Developer’s email address | |
| -v | –dodversion | <email> | No | Create output field file using specific DOD version. Default is to create output file as union of all DODs. |
Which template type should be provided in the -t option is dependent on whether the intent is to create the initial set of source code or to propagate changes made in the PCM that will impact the source code. Primary templates create all the necessary files for a process’s project. The available primary templates by supported languages are shown below.
| create_adi_project Primary Templates | ||
|---|---|---|
| C | IDL | Python |
| transform | idl_transform | py_transform |
| retriever | idl_retriever | py_retriever |
| ingest | idl_ingest | py_ingest |
Individual elements of the project are also templates and should be used to implement updates, if necessary, after development has begun. Updating any template in which you have inserted logic will result in the loss of that logic. Available supporting templates
WARNING: Updating any template in which you have inserted logic will result in the loss of that logic. To prevent over writing the main module <process>_vap.c and supporting <process>_vap.h files into which developers add their code, the template generator should not be rerun using the ‘transform’, ‘retriever’, or ‘ingest’ templates during after development as begun.
Individual elements of the project are also templates and should be used to implement updates, if necessary, after development has begun. Available supporting templates are listed in the table below. The primary tempalte to which they apply is noted next to it by T => transform R => retriever I => ingest
| create_adi_project Supporting Templates by Primary Template | ||||
|---|---|---|---|---|
| Primary Template | C | IDL | Python | |
| transform | makefiles | idl_makefiles | makefiles py_makefile_lib | |
| retriever | makefiles | idl_retriever_makefiles | makefiles py_makefile_lib | |
| ingest | makefiles_ingest | idl_makefiles_ingest | makefiles_ingest py_makefile_lib_ingest | |
| transform | vars | idl_vars | py_vars | |
| retriever | vars_retriever | idl_vars_retriever | py_vars_retriever | |
| transform retriever | test | idl_test | py_test | |
| ingest | test | idl_test_ingest | py_test_ingest | |
| transform retriever | input_fields | idl_input_fields | py_input_fields | |
| transform retriever | output_fields | idl_output_fields | py_output_fields | |
| transform retriever | trans_fields | idl_trans_fields | py_trans_fields | |
Not all updates to the PCM result in the need to regenerate header files. The user must determine whether after any change to the data stored in the DSDB whether they need to rerun the create_adi_project with a subcomponent template option, and if so, which one. A good rule of thumb is if the change affects a DSDB entity (for example, as new variable is being retrieved, a change in the name of a variable, or a coordinate system is defined) then an update is needed. If however, the change only affects a DSDB entity’s attribute (such as units of a variable, or sampling interval of a transform) then none of the VAP’s header files will need to be regenerated.
If a change was made to the PCM that required an update to the source code, and the template was not rerun to create the impacted file(s), the process will fail to run and will produce an error message, giving an indication of the header file with the inconsistency. Table 6.2 lists the template types for C projects, the files they create , and a description of the file’s purpose.
|
||
|---|---|---|
| Template | Files Created | Description |
| transform | <process>_vap.c <process>_vap.h | Creates all files that comprise a C Project that will perform a transformation. Typically run only when creating the initial set of templates. <process>_vap.c main source code into which the user should add VAP specific logic (see the following figure. <process>_vap.h defines prototypes, structures, and macros needed by the user. |
| Makefile | ||
Makefile.aux <process>_input _fields.h <process>_trans_ fields.h <process>_output_ fields.h |
||
| retriever | <process>_vap.c <process>_vap.h | Creates all files that comprise C Project. that will not perform a transformation. Typically run only when creating the initial set of templates. <process>_vap.c main source code into which the user should add specific logic. <process>_vap.h defines prototypes, structures, and macros needed by user. |
| Makefile | ||
Makefile.aux <process>_input_ fields.h <process>_output_ fileds.h |
||
| vars | <process>_input_ fields.h <process>_trans_ fields.h <process>_output _fields.h | Creates all header files. Users should not edit these files. This template can be run if the user is unsure whether, and if so, which, header file is affected by a change to the PCM entries. |
| input_fields | <process>_input_ fields.h | Contains structure of retrieved variables and indexes to access the names within the structure. Not used by ADI libaries, but provided to encourage standarized access to input fields. User should not edit file. |
| trans_fields | <process>_trans_ fields.h | Contains structures of retrieved variables in context of the coordinate systems to which they have been assigned in the PCM. In addition to the variable name, indexes to access the values within the structures names are based on the coordinate system, datastream group from which the variable was retrieved. Not used by ADI libaries, but provided to standarized access to transformed fields. User should not edit file. |
| output_ fields | <process>_output_ fields.h | Contains structures of output variables in context of the output datastreams. In addition to the variable name, indexes to access the values within the structures names are based on the output datastream name and level. Not used by ADI libaries, but provided to encourage standarized access to output fields. User should not edit file. |
| makefiles | Makefile | For a system with a SWAWT environment. http://engineering.arm.gov/base/swawt/ Makefile.aux can be updated to link with outside libraries, adjust compile options, etc. Makefiles typically created with a primary template (transform, retriever, ingest) and should not be recreated after developmment has begun as that will overwrite user updates |
| Makefile.aux | ||
The first time the create_adi_project is run for a new VAP, it should be run using either the ‘transform’ or ‘retriever’ template and include specifications for the author information to document the the VAP’s main C module and header file.
$> create_adi_project -p <process name> -t <primary template type> -o <project directory> -a <’developer name’> -n <developer phone number> -e <your email address>
Note: Quotes are necessary around inputs for which there are white spaces, such as developer name and possibly phone number.
Compiling the code produced by the templater after being run with a primary template as input will produce a binary that will run “out of the box” with the capability of creating output netCDF file(s) with all passthrough variable values and completed DOD headers. The values of output variables whose values are to be calculated by VAP (i.e., the source code the user will add to the <process_name>.c file) will be populated with fill values to indicate that a value has not yet been assigned. Details of the VAP command line options are discussed in create_adi_project Command Line Arguments.
To compile and run a process created with the create_adi_project:
Users should examine the output produced from the template prior to inserting their own code to validate the PCM variable definitions, data conversions, and transformations were executed as expected.
Once a developer has begun adding their own code to the VAP, future runs of the template generator typically should be limited to using one of the header templates. A simple approach is to use the -vars template to recreate all template generated header files.
$> create_adi_project -p <process name> -t vars -o <project directory>
A summary of the changes to PCM entries that can affect the content of one or more ADI templates is presented in the following table, PCM Changes that Impact create_adi_project Templates. Many changes to the PCM process, retrieval, and output DOD do not require a change to the project’s header files. For example, a reordering of variables in the output file does not affect the source code.
| PCM Changes that Impact create_adi_project Templates | |||
|---|---|---|---|
| PCM Area | PCM Element | Change | Templates Affected |
| Process | |||
| Process Name | Rename | primary template [1] (i.e., transform or retriever) | |
| Process Type | Type assignment | primary template [1] | |
| Retrieval Editor | |||
| Source(s): group name | Add, remove, rename | trans_fields | |
| Variable Name | Add, remove, rename | input_fields trans_fields | |
| Coord System: name | Add, remove, rename | trans_fields | |
| Output Datastream/DOD | |||
| Datastream name | Add, remove, rename | output_fields | |
| Datastream level | Change | output_fields | |
| Variable name | Add, remove, rename | output_fields | |
| [1] | (1, 2) A change in name affects all templates, but since transform and retriever primary templates regenerate all files, they are the only ones that need to be rerun. |
The only template files into which the user should add data are the Makefile.aux and the VAP source and header files, <process>_vap.c and <process_vap.h. However, a user is free to create additional *.c and *.h files as they see fit to organize their own code. It is recommended that code additions to the <process>_vap.c file be limited to function calls to functions in other *.c files created by the user.
Before adding their own source code to a project created by the create_adi_project, developers should have an understanding of the process through which data is retrieved, consolidated, and stored in the ADI applications (i.e., the data_conslidator tool and create_adi_project projects compiled ‘as is’ ).
Q: If I do not intend to alter any of the coordinate variables of a retrieved variable in any way, do I have to assign a name to the coordinate system in the PCM Coord System column?
A: No. If no name is provided the coordinate system will be assigned a name equal to auto_ <datastream_name>_<datastream_level>. For example auto_mfr10m_b1. If any change is made to the coordinate system a name of coords_<x> is assigned as x incrementally increases. It is recommended that users apply a more meaningful name to their coordinate system definitions.
Q: What time does base time reflect?
A: The default value of base_time will always be the time of midnight prior to the first sample time. You can change this to be the time of the first sample in the file by calling dsproc_set_base_time just prior to setting the times in the output dataset.
Q: I am getting an error when running create_adi_project for my process, what is causing it?
A: If you have a datastream assigned as an output, but it does not have a DOD defined for it, the project cannot be created. Either add a DOD to the datastream or remove the output datastream from the PCM’s process definition form.
Q: Why does my process end with Suggested exit value:0 (successful), but the output is not what I expected, or the process did not run to completion?
A: The exit value of 0, representing success, indicates that the process completed with no unexpected errors. It is up to the user to insert the necessary error handling logic into their source code. If a process exits with success, but the output is not valid, or the process did not complete, then additional error handling is needed. Find the point the process deviated from the desired output, and set a process error using the DSPROC_ERROR macro or dsproc_error function.
Q: My process is not retrieving a companion QC variable because the variable is not an integer data type in the input file. How do I get my process to run?
A: If a companion QC variable is not an integer it cannot be retrieved. In such cases, do not select the QC checkbox in the Retriever Editor form. Instead, elect to set new min and max values on the variable through the Retriever Editor form. This will result in a new QC variable being created and the limits specified in the PCM applied. If the original QC had more than min, max, or delta tests applied you will also need to explicitly retrieve the QC variable (versus retrieving it by selecting the QC check box for an explicitly retrieved variable). YOU MUST rename the retrieved QC variable by entering a different name in the ‘Variable Name’ column to prevent it from conflicting with the auto-generated QC variable. Lastly, in a user hook, update the auto-generated QC variable as necessary to properly document the quality of the variable.
Q: How do I create quicklooks inside the process interval loop?
A: Output netCDF files are not created until the VAP has processed all dates falling between the begin and end dates specified in the command line. To create quicklooks for each process interval you must update the VAP to produce an output netCDF file as the processing for the individual intervals are completed. To do this, call dsproc_store_dataset at the end of the process_data hook (I need to add a link to this once this documentation is added to tutorial) and create the quicklooks when it completes.
Q: The contents of my output file are not what I expect.
A: If you made changes to your source code did you recompile? Is the location of the directory of the output you’re reviewing the same directory used to create data? Did you update the PCM entries? If so, did you recreate the input, transform, and output configuration files and recompile before rerunning your process? If you did, and the PCM changes you made still didn’t take effect, are you sure you selected the appropriate Save button (if you updated a DOD, it needs to be saved separately from changes made to the process definitions).
Q: My VAP runs with an exit value of 0: (Successful), but an output data file is not created.
A: Did you define mappings from your retrieved to output variables in the PCM Retriever Editor form? In such cases the process will end successfully because it did not encounter an error, but there will be an indication of the problem in the debug messages of the process data_hook. It will be unable to produce a dump of the process_data structure and will note that no data could be found in the output data structure as shown in the following example.:
----- ENTERING PROCESS DATA HOOK -------
dsproc_print.c:168 Creating dataset dump file:
- dataset: /sgpvapexample3E13.c1
- file: ./debug_dumps/sgpvapexample3E13.c1.19700101.000000.process_data.debug
dsproc_hooks.c:179 ----- EXITING PROCESS DATA HOOK --------
dsproc_dataset_store.c:392 sgpvapexample3E13.c1: No data found in output dataset
===================================================================
dsproc.c:1577 EXITING PROCESS
===================================================================
Q: How can files in $CONF_DATA on the ARM production processing system be updated?
A: Because files in $CONF_DATA are expected to change it is not appropriate to release these as part of a process GitLab repository. There are number of ways