You can not select more than 25 topics
			Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
		
		
		
		
		
			
		
			
				
					
					
						
							1477 lines
						
					
					
						
							56 KiB
						
					
					
				
			
		
		
	
	
							1477 lines
						
					
					
						
							56 KiB
						
					
					
				| <html>
 | |
| <body>
 | |
| <center>
 | |
| <pre>
 | |
| /* Copyright 2009, UCAR/Unidata and OPeNDAP, Inc.
 | |
|    See the COPYRIGHT file for more information. */
 | |
| </pre>
 | |
| <h1>NCGEN Internals Documentation</h1>
 | |
| <h3>Draft: 03/07/2009<br>
 | |
| Last Revised: 07/15/2009</h3>
 | |
| </center>
 | |
| 
 | |
| <h1><u>Introduction</u></h1>
 | |
| This document is an ongoing effort to
 | |
| describe the internal operation of the ncgen
 | |
| cdl compiler; ncgen is a part of the netcdf
 | |
| system.
 | |
| <p>
 | |
| The document has two primary parts.
 | |
| <ol>
 | |
| <li><a href="#LANG">Language Support</a>
 | |
| -- describes how to add a new output language to ncgen.
 | |
| <p>
 | |
| <li><a href="#GIT">General Internals Information</a>
 | |
| -- describes additional information about the internals;
 | |
| parsing, for example.
 | |
| </ol>
 | |
| 
 | |
| <h1></u><a name="LANG">Modifying NCGEN to Output a New Language</a></u></h1>
 | |
| 
 | |
| This document outlines the general method for adding
 | |
| a new language output to ncgen. Currently, it supports
 | |
| binary, C, and (experimentally) NcML and Java.
 | |
| Before reading this document, the reader should also
 | |
| review the internals.html document.
 | |
| <p>
 | |
| Also, the reader should note that code is a bit crufty
 | |
| and needs refactoring.  This is primarily because
 | |
| it was originally defined to support only C and
 | |
| each new language stresses the code.
 | |
| <p>
 | |
| In order to get ncgen to generate output for a new
 | |
| language, the following steps are required.
 | |
| 
 | |
| <ol>
 | |
| <li> <a href="#Misc">Modify various files to invoke the new language output.</a>
 | |
| <li> <a href="#Create">Create a new set of generate functions.</a>
 | |
| </ol>
 | |
| 
 | |
| <h2><a name="Misc">Modify various files to invoke the new language output.</a></h2>
 | |
| The following steps are required to provide the necessary code
 | |
| to invoke a new language output.
 | |
| For the purposes of this discussion, let us call the language Java.
 | |
| 
 | |
| <h4>ncgen.h</h4>
 | |
| <ol>
 | |
| <li> Locate the code enabler #defines
 | |
| (e.g. <code>#define ENABLE_C</code>)
 | |
| and insert a new one of the form
 | |
| <pre>
 | |
| #define ENABLE_JAVA
 | |
| </pre>
 | |
| </ol>
 | |
| 
 | |
| <h4>main.c</h4>
 | |
| <ol>
 | |
| <li> Locate the global declaration (<code>int fortran_flag;</code>)
 | |
| and insert a new declaration.
 | |
| <pre>int java_flag;</pre>
 | |
| 
 | |
| <li> Locate the initialization (<code>fortran_flag = 0;</code>)
 | |
| in the body of the main() procedure and add a new initialization.
 | |
| <pre>java_flag = 0;</pre>.
 | |
| 
 | |
| <li>Locate the options processing switch case for -l (<code>case 'l':</code>).
 | |
| Duplicate one of the instances there and add to the conditionals.
 | |
| It should look like this.
 | |
| <pre>
 | |
|     } else if(strcmp(lang_name, "java") == 0
 | |
|        || strcmp(lang_name, "Java") == 0) {java_flag = 1;}
 | |
| </pre>
 | |
| 
 | |
| <li> Just after the options processing switch code,
 | |
| there are a number of #ifndef conditionals
 | |
| (e.g. <code>#ifndef ENABLE_C</code>).
 | |
| Add a new one for Java.
 | |
| It should look like this.
 | |
| <pre>
 | |
| #ifndef ENABLE_JAVA
 | |
|     if(java_flag) {
 | |
| 	  fprintf(stderr,"Java not currently supported\n");
 | |
| 	  exit(1);
 | |
|     }
 | |
| #endif
 | |
| </pre>
 | |
| </ol>
 | |
| 
 | |
| <h2><a name="Create">Create a new set of generate functions.</a></h2>
 | |
| The hard part is  creating the actual code generation files.
 | |
| To do this, it is easiest to take one of the existing
 | |
| generators and modify it, viz:
 | |
| <ul>
 | |
| <li> copy genc.c genj.c
 | |
| <li> copy cdata.c jdata.c
 | |
| </ul>
 | |
| The genj.c file will do most of the code generation. The jdata.c file
 | |
| will generate lists of data constants that come from the CDL data: section.
 | |
| There is nothing magical about using two files: they can be refactored
 | |
| as desired.
 | |
| <p>
 | |
| In order to facilitate code generation, it is useful to look
 | |
| at the translations produced by other languages.
 | |
| The idea is to take these translations and decide what the
 | |
| corresponding Java (for example) code would look like.
 | |
| Then the idea is to modify the genc code (in genj.c)
 | |
| to reflect that translation.
 | |
| <p>
 | |
| In most of the rest of this discussion, the genc.c and cdata.c
 | |
| code will be used to explain the operation.
 | |
| Appropriate procedure renaming should be done for new languages
 | |
| (e.g, for Java, <i>genc_XXX</i> is changed to <i>genj_XXX</i>
 | |
| consistently).
 | |
| 
 | |
| <h3>Useful Output Procedures</h3>
 | |
| The following output procedures are defined in genc.c to create C output.
 | |
| The idea is that output is accumulated in a <a href="#Bytebuffer">Bytebuffer</a>
 | |
| called ccode.  Periodically, ccode
 | |
| contents are flushed to stdout.
 | |
| The relevant procedures from the C code are as follows.
 | |
| <ol>
 | |
| <li> <code>void cprint(Bytebuffer* buf)</code>
 | |
| -- dump the contents of buf to output (ccode actually).
 | |
| <li> <code>void cpartial(char* line)</code>
 | |
| -- dump the specified string to output.
 | |
| <li> <code>void cline(char* line)</code>
 | |
| -- dump the specified string to output and add a newline.
 | |
| <li> <code>void clined(int n, char* line)</code>
 | |
| -- dump the specified string to output preceded by
 | |
| <i>n</i> instances of indentation.
 | |
| <li> <code>void cflush(void)</code>
 | |
| -- dump the contents of ccode to standard output
 | |
| and reset the ccode buffer.
 | |
| </ol>
 | |
| There is, of course, nothing sacred about these procedures:
 | |
| feel free to modify as needed. In fact, there are two
 | |
| important reasons to modify the code.
 | |
| First, the indentation rules may differ from language to language
 | |
| (FORTRAN 77 for example). Second, the rules for folding lines
 | |
| that are too long differ across languages.
 | |
| It is usually easiest to handle both of these issues
 | |
| in the output procedures.
 | |
| <p>
 | |
| The <a href="#Bytebuffer">Bytebuffer</a> type is an important data structure.
 | |
| It allows for dynamically creating strings of characters
 | |
| (actually arbitrary 8 bit values).
 | |
| Most of the operations should be obvious: examine bytebuffer.h.
 | |
| It is used widely in this code especially to capture sub-pieces
 | |
| of the generated code that must be saved for out-of-order output.
 | |
| 
 | |
| <h3>Code Generation</h3>
 | |
| The code generation method used for C is a pretty good
 | |
| general paradigm, so this discussion will use it as a model.
 | |
| The gen_ncc procedure is responsible for
 | |
| creating and dumping the generated C code.
 | |
| <p>
 | |
| It has at its disposal several global lists of Symbols.
 | |
| Note that the lists cross all groups.
 | |
| <ul>
 | |
| <li>dimdefs - the set of symbols defining dimensions.
 | |
| <li>vardefs - the set of symbols defining variables.
 | |
| <li>attdefs - the set of symbols defining non-global attributes.
 | |
| <li>gattdefs - the set of symbols defining global attributes.
 | |
| <li>grpdefs - the set of symbols defining groups.
 | |
| <li>typdefs - the set of symbols defining types; note that this list
 | |
| has been topologically sorted so that a given type depends only
 | |
| on types with lower indices in the list.
 | |
| </ul>
 | |
| <p>
 | |
| The superficial operation of gen_ncc is as follows; the details
 | |
| are provided later where the operation is complex.
 | |
| <ol>
 | |
| <li>Generate header code (e.g. #include <stdio.h>").
 | |
| <li>Generate C type definitions corresponding to the
 | |
| CDL types.
 | |
| <li>Generate VLEN constants.
 | |
| <li>Generate chunking constants.
 | |
| <li>Generate initial part of the main() procedure.
 | |
| <li>Generate C variable definitions to hold the ncids
 | |
| for all created groups.
 | |
| <li>Generate C variable definitions to hold the typeids
 | |
| of all created types.
 | |
| <li>Generate C variables and constants that correspond to
 | |
| to the CDL dimensions.
 | |
| <li>Generate C variable definitions to hold the dimids
 | |
| of all created dimensions.
 | |
| <li>Generate C variable definitions to hold the varids
 | |
| of all created variables.
 | |
| <li>Generate C code to create the netCDF binary file.
 | |
| <li>Generate C code to create the all groups in the proper
 | |
| hierarchy.
 | |
| <li>Generate C code to create the type definitions.
 | |
| <li>Generate C code to create the dimension definitions.
 | |
| <li>Generate C code to create the variable definitions.
 | |
| <li>Generate C code to create the global attributes.
 | |
| <li>Generate C code to create the non-global attributes.
 | |
| <li>Generate C code to leave define mode.
 | |
| <li>Generate C code to assign variable datalists.
 | |
| </ol>
 | |
| <p>
 | |
| The following code generates C code for defining the groups.
 | |
| It is fairly canonical and can be seen repeated in variant form
 | |
| when defining dimensions, types, variables, and attributes.
 | |
| <p>
 | |
| This code is redundant but for consistency, the root group
 | |
| ncid is stored like all other group ncids.
 | |
| Note that nprintf is a macro wrapper around snprint.
 | |
| <pre>
 | |
| nprintf(stmt,sizeof(stmt),"%s%s = ncid;",indented(1),groupncid(rootgroup));
 | |
| cline(stmt);
 | |
| </pre>
 | |
| <p>
 | |
| The loop walks all group symbols in preorder form
 | |
| and generates C code call to nc_def_grp
 | |
| using parameters taken from the group Symbol instance (gsym).
 | |
| The call to nc_def_grp is succeeded by a call to the
 | |
| check_err procedure to verify the operation's result code.
 | |
| <pre>
 | |
| for(igrp=0;igrp<listlength(grpdefs);igrp++) {
 | |
|     Symbol* gsym = (Symbol*)listget(grpdefs,igrp);
 | |
|     if(gsym == rootgroup) continue; // ignore root
 | |
|     if(gsym->container == NULL) PANIC("null container");
 | |
|     nprintf(stmt,sizeof(stmt),
 | |
|             "%sstat = nc_def_grp(%s, \"%s\", &%s);",
 | |
| 	    indented(1),
 | |
|             groupncid(gsym->container),
 | |
|             gsym->name, groupncid(gsym));
 | |
|     cline(stmt); // print the def_grp call
 | |
|     clined(1,"check_err(stat,__LINE__,__FILE__);");
 | |
| }
 | |
| flushcode();
 | |
| </pre>
 | |
| Note the call to indented(). It generates a blank string corresponding
 | |
| to indentation to a level of its argument N; level n might result in
 | |
| more or less than N blank characters.
 | |
| <p>
 | |
| Note also that one must be careful when dumping names
 | |
| (e.g. gsym->name above) if the name is expected to contain
 | |
| utf8 characters. For C, utf8 works fine in strings, but with
 | |
| a language like Java, which takes utf-16 characters,
 | |
| some special encoding is required to convert the non-ascii
 | |
| characters to use the \uxxxx form.
 | |
| <p>
 | |
| The code to generate dimensions, types, attributes, variables
 | |
| is similar, although often more complex.
 | |
| <p>
 | |
| The code to generate C equivalents of CDL types is
 | |
| in the procedure definectype().
 | |
| Note that this code is not the code that invokes e.g. nc_def_vlen.
 | |
| The generated C types are used when generating datalists
 | |
| so that the standard C constant assignment mechanism will produce
 | |
| the correct memory values.
 | |
| <p>
 | |
| For non-C languages, the interaction between this code and the
 | |
| nc_def_TYPE code may be rather more complex than with C.
 | |
| <p>
 | |
| The genc_deftype procedure is the one that actually
 | |
| generates C code to define the netcdf types.
 | |
| The generated C code is designed to store the resulting
 | |
| typeid into the C variable defined earlier
 | |
| for holding that typeid.
 | |
| <p>
 | |
| Note that for compound types, the NC_COMPOUND_OFFSET
 | |
| macro is normally used to match netcdf offsets to
 | |
| the corresponding struct type generated in definectype.
 | |
| However, there is a flag, TESTALIGNMENT,
 | |
| that can be set to use a computed value for the offset.
 | |
| And for non-C languages, handling offsets is tricky and is
 | |
| addressed in more detail below.
 | |
| 
 | |
| <h3>Data Generation Methods</h3>
 | |
| There are basically three known approaches for generating
 | |
| the data constants that are passed to, for example, <i>nc_put_vara</i>.
 | |
| <ol>
 | |
| <li> For C (and C++) it is possible to generate C language constants
 | |
| directly into the code using the C initializer syntax.
 | |
| This is because CDL was originally defined with C in mind.
 | |
| This method can also be used for FORTRAN when doing classic model only.
 | |
| <p>
 | |
| <li> Generate the binary data
 | |
| and convert it to a large single string constant using
 | |
| appropriate escaping mechanisms; this was done in the original
 | |
| ncgen.
 | |
| This method has the advantage that it can be used for most
 | |
| languages, but it has (at least) two disadvantages:
 | |
| (1) it is not generally portable because the machine architecture
 | |
| influences the memory encoding; (2) it loses all information
 | |
| about the structure of the memory and hence makes more debugging
 | |
| difficult.
 | |
| <p>
 | |
| <li>Extend the netCDF interface with a set
 | |
| of operations to build up the memory structure piece by piece.
 | |
| This is the approach taken in the Java generation code.
 | |
| <p>
 | |
| The idea is that one has a set of procedures in C with a simple
 | |
| interface that can be invoked by the output language.
 | |
| These procedures do the following.
 | |
| <ol>
 | |
| <li>Create a dynamically extendible memory buffer (much like Bytebuffer).
 | |
| <li>Append an array of instances
 | |
| of some primitive type to a specified buffer.
 | |
| <li>Invoke nc_put_vara with a specified buffer.
 | |
| <li>Reclaim a buffer
 | |
| </ol>
 | |
| Appropriate calls to these procedures can construct any required memory
 | |
| in a portable fashion.
 | |
| <p>
 | |
| This method is appropriate to use with most non-C languages, with interpretive
 | |
| languages (e.g., Ruby and Perl), and even is probably the best way to
 | |
| get FORTRAN to handle the full netcdf-4 data model.
 | |
| </ol>
 | |
| 
 | |
| <h3>Data Generation: Overview</h3>
 | |
| The way to think about data generation is to consider
 | |
| the following tree.
 | |
| <ul>
 | |
| <li>The root is a convenience and represents the whole
 | |
| set of variables specified in the CDL "data:" section.
 | |
| <li>The nodes in the tree just below the root represent
 | |
| the set of variables to which values are assigned in the
 | |
| data section.
 | |
| <li>The subtrees below each variable are the basetypes
 | |
| of each variable.  Thus if a variables x has a basetype 
 | |
| that is a compound type, then the node below x will
 | |
| represent the whole compound type and the nodes below
 | |
| that compound type node will be the fields of the compound
 | |
| type, and so on.
 | |
| <li>The leaves of this tree are all of primitive type
 | |
| (e.g. NC_CHAR, NC_INT, NC_STRING).
 | |
| </ul>
 | |
| <p>
 | |
| The data generation code is divided into two
 | |
| primary groups.  One group handles all non-primitive variables
 | |
| and types. The other group handles all primitive variables
 | |
| and types (especially fields). The reason for this is that
 | |
| almost all languages can handle simple lists of primitive values.
 | |
| However, for non-primitive types, one of the methods from the previous
 | |
| section needs to be used.
 | |
| <p>
 | |
| Secondarily, the primitive handling code is divided into
 | |
| two groups.  One group handles the character type
 | |
| and the other group handles all other primitive types.
 | |
| The code for the first group is in chardata.c and is generally
 | |
| usable across all languages.
 | |
| <p>
 | |
| The reason for this split is for historical reasons.
 | |
| It turns out that it is tricky to properly handle variables
 | |
| (or Compound type fields) of type NC_CHAR.
 | |
| Here the term "proper" means to mimic the output of
 | |
| the original ncgen program. To this end, a set of generically useful routines
 | |
| are define in the chardata.c file.  These routines take a datasource
 | |
| and walk it to build a single string of characters, with appropriate fill,
 | |
| to correspond to a NC_CHAR typed variable or field.
 | |
| Unless your language has special
 | |
| requirements, it is probably best to always use these routines to process
 | |
| datalists for variables of type NC_CHAR.
 | |
| 
 | |
| <h3>Data Generation: Part I</h3>
 | |
| Data generation occurs in several places, but is roughly
 | |
| divided into two parts. First, the genc.c code will set up
 | |
| appropriate declarations to hold the data. Second, the code
 | |
| in cdata.c will generate the actual memory contents that must be
 | |
| passed to nc_put_vara.
 | |
| <p>
 | |
| As a rule, the genc.c code calls a limited set of
 | |
| entry points into cdata.c. Again as a rule,
 | |
| cdata.c does not call genc.c code except for the closure
 | |
| mechanism described below.
 | |
| <p>
 | |
| The critical pieces of code for part I are the procedures
 | |
| genc_defineattr() and genc_definevardata() in genc.c.
 | |
| 
 | |
| <h4>genc_definevardata</h4>
 | |
| This procedure is responsible for generating C constants corresponding
 | |
| to the data to be assigned to a variable as defined in the "data:" section
 | |
| of a CDL file. It is also responsible for
 | |
| generating the appropriate nc_put_vara_XXX code to actually assign
 | |
| the data to the variable.
 | |
| 
 | |
| <h4>genc_defineattr</h4>
 | |
| This procedure is responsible for generating C constants corresponding
 | |
| to the data to be assigned to an attribute.
 | |
| from a CDL file. It is also responsible for
 | |
| generating the appropriate nc_put_att_XXX code to actually define
 | |
| the attribute.
 | |
| <p>
 | |
| As with variables, defining attributes of type NC_CHAR requires use
 | |
| of the gen_charXXX procedures.
 | |
| 
 | |
| <h3>Data Generation: Part II</h3>
 | |
| The procedures in cdata.c walk a datalist
 | |
| and generate a sequence of space separated constants
 | |
| and possibly with nested paired braces ("{...}") as needed.
 | |
| The result is placed into a specified Bytebuffer.
 | |
| <p>
 | |
| As an aside, commas are added when needed to the list of constants
 | |
| using the <i>commify</i> procedure.
 | |
| <p>
 | |
| Their are three primary procedures that are called from
 | |
| the genj.c code.
 | |
| <ul>
 | |
| <li>genc_attrdata --
 | |
| store (in its Bytebuffer argument) the sequence of constants
 | |
| corresponding to a given attribute datalist.
 | |
| <li>genc_scalardata --
 | |
| store the single constant (which may be of a user-defined type)
 | |
| corresponding to its variable's datalist.
 | |
| <li>and genc_arraydata.
 | |
| store the vector of  constants corresponding to its variable's datalist.
 | |
| This is by far the most complicated of the three procedures.
 | |
| </ul>
 | |
| <p>
 | |
| Internally, each of these three procedures invokes
 | |
| the <i>genc_data</i> procedure to process part of a datalist.
 | |
| 
 | |
| 
 | |
| <h3>Closures and VLEN</h4>
 | |
| Closures and VLEN handling are two rather specialized mechanisms.
 | |
| 
 | |
| <h4>Closures</h4>
 | |
| The data generation code uses a concept of closure or callback
 | |
| to allow the datalist processing to periodically
 | |
| call external code to do the actual C code generation.
 | |
| The reason for this is that it significantly improves
 | |
| performance if the generated datalist is periodically
 | |
| dumped to the netcdf .nc file using <i>nc_put_vara</i>.
 | |
| Note that the closure mechanism is only used for generating
 | |
| variable data; attributes cannot use this mechanism
 | |
| since they are defined all at once.
 | |
| <p>
 | |
| Basically, each call to the callback will generate
 | |
| C code for some C constants and calls to nc_put_vara().
 | |
| The closure data structure (struct Putvar) is defined as follows.
 | |
| <pre>
 | |
| typedef struct Putvar {
 | |
|     int (*putvar)(struct Putvar*, Odometer*, Bytebuffer*);
 | |
|     int rank;
 | |
|     Bytebuffer* code;
 | |
|     size_t startset[NC_MAX_VAR_DIMS];
 | |
|     struct CDF {
 | |
|         int grpid;
 | |
|         int varid;
 | |
|     } cdf;
 | |
|     struct C {
 | |
|         Symbol* var;
 | |
|     } c;
 | |
| } Putvar;
 | |
| </pre>
 | |
| An instance of the closure is created for
 | |
| each variable that is the target of nc_put_vara().
 | |
| It is initialized with the variable's symbol, rank, group id and variable
 | |
| id. It is also provided with a Bytebuffer into which it is supposed
 | |
| to store the generated C code.
 | |
| The startset is the cached previous set of dimension indices used
 | |
| for generating the nc_put_vara (see below).
 | |
| <p>
 | |
| The callback procedure (field "putvar")
 | |
| for generating C code putvar is assigned to the procedure called cputvara()
 | |
| (defined in genc.c).
 | |
| This procedure takes as arguments the closure object,
 | |
| an <a href="#odometer">odometer</a> describing the current set of dimension indices,
 | |
| and a Bytebuffer containing the generated C constants
 | |
| to be assigned to this slice of the variable.
 | |
| <p>
 | |
| Every time the closure procedure is called, it generates a C variable
 | |
| to hold the generated C constant. It also generated
 | |
| C constants to hold the start and count vectors required
 | |
| by <i>nc_put_vara</i>. It then generates an <i>nc_put_vara()</i> call.
 | |
| The start vector argument for the nc_put_vara is defined by the startset
 | |
| field of the closure. The count vector argument to nc_put_vara
 | |
| is computed from the current cached 
 | |
| start vector and from the indices in the odometer.
 | |
| After the nc_put_vara() is generated, the odometer vector
 | |
| is assigned to the startset field in the closure for use on the next call.
 | |
| <p>
 | |
| There are some important assumptions about the state of the odometer
 | |
| when it is called.
 | |
| <ol>
 | |
| <li>The zeroth index controls the count set.
 | |
| <li>All other indices are assumed to be at their max values.
 | |
| </ol>
 | |
| <p>
 | |
| In particular, this means that the start vector is zero
 | |
| for all positions except position zero. The count vector
 | |
| is positions, except zero is the index in the odometer,
 | |
| which is assumed to be the max.
 | |
| <p>
 | |
| For start position zero, the position is taken from the last
 | |
| saved startset. The count position zero is the difference between
 | |
| that last start position and the current odometer zeroth index.
 | |
| 
 | |
| <h4>VLEN Constants</h4>
 | |
| VLEN constants need to be constructed
 | |
| as separate C data constants because
 | |
| the C compiler will never convert nested
 | |
| groups ({...}) to separate memory chunks.
 | |
| Thus, ncgen must in several places
 | |
| generate the VLEN constants as separate variables
 | |
| and then insert pointers to them in the appropriate
 | |
| places in the later datalist C constants.
 | |
| Note that this process can be very tricky
 | |
| for non-C language (see genj.c and jdata.c for one approach).
 | |
| <p>
 | |
| As an optimization, ncgen tracks which datatypes
 | |
| will require use of vlen constants.
 | |
| This is any type whose definition is a vlen or whose
 | |
| basetype contains a vlen type.
 | |
| <p>
 | |
| The vlen generation process is two-fold.
 | |
| First, in the procedure processdatalist1() in semantics.c,
 | |
| the location of the struct Datalist objects
 | |
| that correspond to vlen constants is stored in a list called vlenconstants.
 | |
| When detected, each such Datalist object is tagged with
 | |
| a unique identifier and the vlen length (count).
 | |
| These will be used later to generate references to the vlen constant.
 | |
| These counts are only accurate for non-char typed variables;
 | |
| Special handling is in place to handle character vlen constants.
 | |
| <p>
 | |
| The second vlen constant processing action is in the
 | |
| procedure genc_vlenconstant() in cdata.c First, it walks the
 | |
| vlenconstants list and generates C code for C variables to
 | |
| define the vlen constant and C code to assign the vlen
 | |
| constant's data to that C variable.
 | |
| <p>
 | |
| When, later, the genc_datalist procedure encounters
 | |
| a Datalist tagged as representing a data list, it can generate
 | |
| a nc_vlen_t constant as {<count>,<vlenconstantname>}
 | |
| and use it directly in the generated C datalist constant.
 | |
| 
 | |
| 
 | |
| <h2>Utility Data Structures</h2>
 | |
| 
 | |
| <h3>Pool Memory Allocation</h3>
 | |
| As an approximation to garbage collection,
 | |
| this code uses a pool allocation mechanism.
 | |
| The goal is to allow dynamic construction of strings
 | |
| that have very short life-times; typically they are used
 | |
| to construct strings to send to the output file.
 | |
| <p>
 | |
| The pool mechanism wraps malloc and records the malloc'd
 | |
| memory in a circular buffer. When the buffer reaches its maximum
 | |
| size, previously allocated pool buffers are free'd.
 | |
| This is good in that the user does not have to litter
 | |
| code with free() statements. It is bad in that the pool
 | |
| allocated memory can be free'd too early if the memory
 | |
| does not have a short enough life.
 | |
| If you suspect the latter, then bump the size of the circular buffer
 | |
| and see if the problem goes away. If so, then your code
 | |
| is probably holding on to a pool buffer too long and should use
 | |
| regular malloc/free.
 | |
| <p>
 | |
| In the end, I am not sure if this is a good idea, but
 | |
| if does make the code simpler.
 | |
| 
 | |
| <h3><a name="List">List<a> and <a name="Bytebuffer">Bytebuffer</a></h3>
 | |
| The two datatypes List and Bytebuffer are used through out the
 | |
| code.  They correspond closely in semantics to the Java Arraylist
 | |
| and Stringbuffer types, respectively.  They are used to help
 | |
| encapsulate dynamically growing lists of objects or bytes
 | |
| to reduce certain kinds of memory allocation errors.
 | |
| <p>
 | |
| The canonical code for non-destructive walking of a List<T>
 | |
| is as follows.
 | |
| <pre>
 | |
| for(i=0;i<listlength(list);i++) {
 | |
|     T* element = (T*)listget(list,i);
 | |
|     ...
 | |
| }
 | |
| </pre>
 | |
| <p>
 | |
| Bytebuffer provides two ways to access its internal buffer of characters.
 | |
| One is "bbContents()", which returns a direct pointer to the buffer,
 | |
| and the other is "bbDup()", which returns a malloc'd string containing
 | |
| the contents and is guaranteed to be null terminated.
 | |
| 
 | |
| <h3><a name="odometer">Odometer: Multi-Dimensional Array Handling</a></h3>
 | |
| The odometer data type is used to convert 
 | |
| multiple dimensions into a single integer.
 | |
| The rule for converting a multi-dimensional
 | |
| array to a single dimensions is as follows.
 | |
| <p>
 | |
| Suppose we have the declaration <code>int F[2][5][3];</code>.
 | |
| There are obviously a total of 2 X 5 X 3 = 30 integers in F.
 | |
| Thus, these three dimensions will be reduced to a single dimension of size 30.
 | |
| <p>
 | |
| A particular point in the three dimensions, say [x][y][z], is reduced to
 | |
| a number in the range 0..29 by computing <code>((x*5)+y)*3+z</code>.
 | |
| The corresponding general C code is as follows.
 | |
| <pre>
 | |
| size_t
 | |
| dimmap(int rank, size_t* indices, size_t* sizes)
 | |
| {
 | |
|     int i;
 | |
|     size_t count = 0;
 | |
|     for(i=0;i<rank;i++) {
 | |
| 	if(i > 0) count *= sizes[i];
 | |
| 	count += indices[i];
 | |
|     }
 | |
|     return count;
 | |
| }
 | |
| </pre>
 | |
| In this code, the indices variable corresponds to the x,y, and z.
 | |
| The sizes variable corresponds to the 2,5, and 3.
 | |
| <p>
 | |
| The Odometer type stores a set of dimensions
 | |
| and supports operations to iterate over all possible
 | |
| dimension combinations.
 | |
| The definition of Odometer is defined by the types Odometer and Dimdata.
 | |
| <pre>
 | |
| typedef struct Dimdata {
 | |
|     unsigned long datasize; // actual size of the datalist item
 | |
|     unsigned long index;    // 0 <= index < datasize
 | |
|     unsigned long declsize;
 | |
| } Dimdata;
 | |
| 
 | |
| typedef struct Odometer {
 | |
|     int     rank;
 | |
|     Dimdata dims[NC_MAX_VAR_DIMS];
 | |
| } Odometer;
 | |
| </pre>
 | |
| The following primary operations are defined.
 | |
| <ul>
 | |
| <li>Odometer* newodometer(Dimset*) - create an odometer from a set of Dimsets.
 | |
| <li>void freeodometer(Odometer*) - release the memory of an odometer.
 | |
| <li>int odometermore(Odometer* odom) - return 1 if there are more combinations
 | |
| of dimension values.
 | |
| <li>int odometerincr(Odometer* odo,int) - move to the next combination
 | |
| of dimension values.
 | |
| <li>unsigned long odometercount(Odometer* odo) -
 | |
| apply the above algorithm to convert the current odometer combination
 | |
| into a single integer.
 | |
| </ul>
 | |
| 
 | |
| 
 | |
| 
 | |
| <h2>Misc. Notes</h2>
 | |
| <ul>
 | |
| <li> The flag "usingclassic" should be consulted when appropriate to determine
 | |
| is this CDL file should be treated as using only the netCDF classic model.
 | |
| </ul>
 | |
| 
 | |
| <h2><u>Change Log</u></h2>
 | |
| <ul>
 | |
| <li>07/04/2009 - First draft.
 | |
| </ul>
 | |
| 
 | |
| </body>
 | |
| </html>
 | |
| 
 | |
| <p>
 | |
| 
 | |
| 
 | |
| <i>genc_scalardata</i> or <i>genc_arraydata</i>.
 | |
| It stores in its Bytebuffer argument the sequence of constants
 | |
| corresponding to a given datalist. Handling commas is a tricky issue
 | |
| so you will that many of the non-top-level routines in cdata.c
 | |
| take a pointer to a global state element, commap, that determines the
 | |
| current state of adding commas. The idea is that at the beginning of
 | |
| any (sub-) Datalist, we want to turn off the comma in front of the
 | |
| first generated constant and then add commas until be reach the end
 | |
| of that (sub-)Datalist.
 | |
| 
 | |
| <h1></u><a name="GIT">General Internals Information</a></u></h1>
 | |
| 
 | |
| <h2><u>Primary NCGEN Data Structures</u></h2>
 | |
| There are two primary structures used in ncgen:
 | |
| <a href="#Symbol">struct Symbol</a>) and
 | |
| <a href="#Datalist">struct Datalist</a>).
 | |
| 
 | |
| <h3><a name="Symbol">struct Symbol</a></h3>
 | |
| Symbol objects are linked into hierarchical structures
 | |
| to represent netcdf dimensions, types, groups, and variables.
 | |
| The struct has the following fields.
 | |
| <table>
 | |
| <tr><th colspan=3>struct Symbol Fields
 | |
| <tr valign=top><td>struct Symbol* next<td>-<td>
 | |
| The Symbol objects are all kept on a single linked list.
 | |
| No symbol is ever deleted until the end of the program.
 | |
| <tr valign=top><td>nc_class objectclass<td>-<td>
 | |
| This defines the general class of symbol, one of: NC_GRP, NC_DIM, NC_VAR, NC_ATT, or NC_TYPE.
 | |
| <tr valign=top><td>nc_classsubclass<td>-<td>
 | |
| This defines the sub class of symbol, one of:
 | |
| NC_PRIM, NC_OPAQUE, NC_ENUM,
 | |
| NC_FIELD, NC_VLEN, NC_COMPOUND,
 | |
| NC_ECONST, NC_ARRAY, or NC_FILLVALUE.
 | |
| <tr valign=top><td>char*name<td>-<td>
 | |
| The symbol's name.
 | |
| <tr valign=top><td>struct Symbol* container<td>-<td>
 | |
| The symbol that is the container for this symbol.
 | |
| Typically, this the group symbol that contains
 | |
| this symbol.
 | |
| <tr valign=top><td>struct Symbol location<td>-<td>
 | |
| The current group that was open when this symbol was created.
 | |
| <tr valign=top><td>List* subnodes<td>-<td>
 | |
| The list of child symbols of this symbol.
 | |
| For example, a group symbol will have its dimensions,
 | |
| types, vars, and subgroups will be in this list.
 | |
| <tr valign=top><td>int is_prefixed<td>-<td>
 | |
| True if the name of this symbol contains a complete
 | |
| prefix path (e.g. /x/y/z).
 | |
| <tr valign=top><td>List* prefix<td>-<td>
 | |
| A list of the prefix names for this node.
 | |
| Note that if is_prefixed is false, then this
 | |
| list was constructed from the set of enclosing groups.
 | |
| <tr valign=top><td>struct Datalist* data<td>-<td>
 | |
| Stores the constants from attribute or datalist
 | |
| constructs.
 | |
| <tr valign=top><td>Typeinfo typ<td>-<td>
 | |
| Type information about this symbol
 | |
| as defined by the Typeinfo structure.
 | |
| <tr valign=top><td>Varinfo var<td>-<td>
 | |
| Variable information about a variable symbol
 | |
| as defined by the Varinfo structure.
 | |
| <tr valign=top><td>Attrinfo att<td>-<td>
 | |
| Attribute information about an attribute symbol
 | |
| as defined by the Attrinfo structure.
 | |
| <tr valign=top><td>Diminfo dim<td>-<td>
 | |
| Dimension information about a dimension symbol
 | |
| as defined by the Diminfo structure.
 | |
| <tr valign=top><td>Groupinfo grp<td>-<td>
 | |
| Group information about a group symbol
 | |
| as defined by the Groupinfo structure.
 | |
| <tr valign=top><td>int lineno<td>-<td>
 | |
| The source line in which this symbol was created.
 | |
| <tr valign=top><td>int touched<td>-<td>
 | |
| Used in transitive closure operations
 | |
| to prevent revisiting symbols.
 | |
| <tr valign=top><td>char* lname<td>-<td>
 | |
| Cached C or FORTRAN name (not used?).
 | |
| <tr valign=top><td>int ncid<td>-<td>
 | |
| The ncid/varid/dimid, etc when
 | |
| defining netcdf objects.
 | |
| </table>
 | |
| 
 | |
| <h4>struct Groupinfo</h4>
 | |
| Group symbols primarily keep the group
 | |
| containment structure in the subnodes field of the Symbol.
 | |
| <p>
 | |
| <table>
 | |
| <tr><th colspan=3>struct Groupinfo Fields
 | |
| <tr valign=top><td>int is_root<td>-<td>
 | |
| Is this the root group?
 | |
| </table>
 | |
| 
 | |
| <h4>struct Diminfo</h4>
 | |
| The only important information about a dimension,
 | |
| aside from name, is the dimension size.
 | |
| Additionally, type definitions may have anonymous
 | |
| (unnamed) dimensions.
 | |
| <p>
 | |
| <table>
 | |
| <tr><th colspan=3>struct Diminfo Fields
 | |
| <tr valign=top><td>int isconstant<td>-<td>
 | |
| Is this an anonymous dimension?
 | |
| <tr valign=top><td>unsigned int size<td>-<td>
 | |
| The size of the dimension.
 | |
| </table>
 | |
| 
 | |
| <h4>struct Varinfo</h4>
 | |
| Variables require two primary pieces of information:
 | |
| the set of attributes (including special attributes)
 | |
| and dimension information. The dimension information
 | |
| is kept in the Typeinfo structure because things
 | |
| other than variables have dimensions (e.g. user defined types).
 | |
| <p>
 | |
| <table>
 | |
| <tr><th colspan=3>struct Varinfo Fields
 | |
| <tr valign=top><td>int nattributes<td>-<td>
 | |
| The number of attributes; this is redundant but useful.
 | |
| <tr valign=top><td>List* attributes<td>-<td>
 | |
| The list of all attribute symbols associated with this
 | |
| variable.
 | |
| <tr valign=top><td>Specialdata special<td>-<td>
 | |
| Special attribute values.
 | |
| </table>
 | |
| 
 | |
| <h4>struct Typeinfo</h4>
 | |
| The type information is probably the second most
 | |
| used structure in all of the code (second to Symbol itself).
 | |
| <p>
 | |
| <table>
 | |
| <tr><th colspan=3>struct Typeinfo Fields
 | |
| <tr valign=top><td>struct Symbol* basetype<td>-<td>
 | |
| Provide a reference to the base type of this symbol.
 | |
| This applies to other types, variables, and attributes.
 | |
| <tr valign=top><td>int hasvlen<td>-<td>
 | |
| Does the type have a vlen definition anywhere within it.
 | |
| This is used as an optimization to avoid searching datalists
 | |
| for vlen constants.
 | |
| <tr valign=top><td>nc_type typecode<td>-<td>
 | |
| The typecode of the basetype.  This is most useful
 | |
| when the basetype is a primitive type.
 | |
| <tr valign=top><td>unsigned long size<td>-<td>
 | |
| The size of this object.
 | |
| <tr valign=top><td>unsigned long offset<td>-<td>
 | |
| The field offset for fields in compound types.
 | |
| <tr valign=top><td>unsigned long alignment<td>-<td>
 | |
| The memory alignment (i.e. 1,2,4,or 8).
 | |
| <tr valign=top><td>Constant econst<td>-<td>
 | |
| For enumeration constants, the actual value of the constant.
 | |
| <tr valign=top><td>Dimset dimset<td>-<td>
 | |
| The dimension information for the type or variable.
 | |
| The dimset stores the number of dimensions and a list
 | |
| of pointers to the corresponding dimension symbols.
 | |
| </table>
 | |
| 
 | |
| <h4>struct Attrinfo</h4>
 | |
| Note that the actual attribute data is stored
 | |
| in the data field of the containing Symbol.
 | |
| <p>
 | |
| <table>
 | |
| <tr><th colspan=3>struct Attrinfo Fields
 | |
| <tr valign=top><td>struct Symbol* var<td>-<td>
 | |
| The variable with which this attribute is associated;
 | |
| it is NULL for global attributes.
 | |
| <tr valign=top><td>unsigned long count<td>-<td>
 | |
| The number of instances associated with the attribute value.
 | |
| </table>
 | |
| 
 | |
| <h3><a name="Datalist">Datalists and Datasrcs</a></h3>
 | |
| Whenever a datalist is encountered during parsing, it is converted
 | |
| to an instance of struct Datalist.
 | |
| Each datalist instance contains a vector of instances of
 | |
| struct Constant that contains the actual data.
 | |
| <p>
 | |
| Each datalist instance contains the following information.
 | |
| <table>
 | |
| <tr><th colspan=3>struct Datalist Fields
 | |
| <tr valign=top><td>struct Datalist* next<td>-<td>
 | |
| All datalists are chained for reclamation.
 | |
| <tr valign=top><td>int readonly<td>-<td>
 | |
| Can this datalist be modified?
 | |
| <tr valign=top><td>unsigned int length<td>-<td>
 | |
| The number of Constant instances in the data field.
 | |
| <tr valign=top><td>unsigned int alloc<td>-<td>
 | |
| The memory space allocated to the data field.
 | |
| <tr valign=top><td>Constant* data<td>-<td>
 | |
| The vector in sequential memory of the constants comprising this datalist.
 | |
| <tr valign=top><td>struct Symbol* schema<td>-<td>
 | |
| The symbol (type, variable, or attribute) defining the structure of this datalist,
 | |
| if known.
 | |
| <tr valign=top><td>struct Vlen {<td>-<td>
 | |
| Information about the vlen instances contained in this datalist. 
 | |
| <tr><td>unsigned int count;
 | |
| <tr><td>unsigned int uid;
 | |
| <tr><td>} vlen
 | |
| <tr valign=top><td>Odometer* dimdata<td>-<td>
 | |
| A tracker to count through dimensions associated with this datalist via the schema.
 | |
| </table>
 | |
| <p>
 | |
| In turn, a Constant instance is defined as follows.
 | |
| <pre>
 | |
| typedef struct Constant {
 | |
|     nc_type 	  nctype;
 | |
|     int		  lineno;
 | |
|     Constvalue    value;
 | |
| } Constant;
 | |
| </pre>
 | |
| It indicates the type of the value and the source line number (if known)
 | |
| in which this constant was created.
 | |
| <p>
 | |
| The ConstValue type is a union
 | |
| of all possible values that can occur
 | |
| in a datalist.
 | |
| <pre>
 | |
| typedef union Constvalue {
 | |
|     struct Datalist* compoundv; // NC_COMPOUND
 | |
|     char charv;                 // NC_CHAR
 | |
|     signed char int8v;          // NC_BYTE
 | |
|     unsigned char uint8v;       // NC_UBYTE
 | |
|     short int16v;               // NC_SHORT
 | |
|     unsigned short uint16v;     // NC_USHORT
 | |
|     int int32v;                 // NC_INT
 | |
|     unsigned int uint32v;       // NC_UINT
 | |
|     long long int64v;           // NC_INT64
 | |
|     unsigned long long uint64v; // NC_UINT64
 | |
|     float floatv;               // NC_FLOAT
 | |
|     double doublev;             // NC_DOUBLE
 | |
|     struct Stringv {            // NC_STRING
 | |
|         int len;
 | |
|         char* stringv;
 | |
|     } stringv;
 | |
|     struct Opaquev {     // NC_OPAQUE
 | |
|         int len; // length as originally written (rounded to even number)
 | |
|         char* stringv; //as  constant was written
 | |
|                       // (padded to even # chars >= 16)
 | |
|                       // without leading 0x
 | |
|     } opaquev;
 | |
|     struct Symbol* enumv;   // NC_ECONST
 | |
| } Constvalue;
 | |
| </pre>
 | |
| <p>
 | |
| Several fields are of particular interest:
 | |
| <table>
 | |
| <tr><th colspan=3>Selected Constvalue Fields
 | |
| <tr valign=top><td>struct Datalist* compoundv<td>-<td>
 | |
| This stores nested datalists - typically
 | |
| of the form "{...{...}...}".
 | |
| <tr valign=top><td>struct Stringv {int len; char* stringv;} stringv<td>-<td>
 | |
| Store string constants.
 | |
| <tr valign=top><td>struct Opaquev {int len; char* stringv;} opaquev<td>-<td>
 | |
| Store opaque constants as written (i.e. abc...),
 | |
| without the leading 0x, and
 | |
| padded to an even number of characters to be
 | |
| at least 16 characters long.
 | |
| <tr valign=top><td>struct Symbol* enumv<td>-<td>
 | |
| Pointer to an enumeration constant definition.
 | |
| </table>
 | |
| 
 | |
| <h4>struct Datasrc</h3>
 | |
| When it comes time to generate datalists for output,
 | |
| it is necessary to "walk" the datalist (including nested
 | |
| datalist).  The Datasrc structure is used to do this.
 | |
| Its definition is as follows.
 | |
| <pre>
 | |
| typedef struct Datasrc {
 | |
|     unsigned int index;     // 0..length-1
 | |
|     unsigned int length;
 | |
|     int          autopop;   // pop when at end
 | |
|     Constant*    data;      // duplicate pointer; so do not free.
 | |
|     struct Datasrc* stack;
 | |
| } Datasrc;
 | |
| </pre>
 | |
| The Datasrc tracks the "current" location in the sequence
 | |
| of Constants (taken from a Datalist). The index field indicates
 | |
| the current location. 
 | |
| In effect, Datasrc is the lexer and the code
 | |
| that is walking it is in effect parsing the data sequence.
 | |
| The following operations are supported (see data.[ch]).
 | |
| <ul>
 | |
| <li>datalist2src - takes a Datalist and constructs a Datasrc.
 | |
| <li>srcpush - assumes the current constant is a nested Datalist
 | |
| and pushes into that Datalist.
 | |
| <li>srcpushlist - pushes into the passed Datalist argument.
 | |
| <li>srcpop - pops the current list and resumes the next list in the
 | |
| stack.
 | |
| <li>srcnext - return the value at the index
 | |
| and then advance the Datasrc index.
 | |
| If at the end of the current datalist, then return NULL;
 | |
| srcincr is an alias for srcnext.
 | |
| <li>srcmore - return 1 is not at the end of the current Datasrc.
 | |
| Pushed datalists are not considered.
 | |
| <li>srcline - return a usable line number associated with the current
 | |
| position of the Datasrc (that is why Constant instances have a line
 | |
| number).
 | |
| <li>srcpeek - return the value at the index but do not advance.
 | |
| If at the end of the current datalist, then return NULL; srcget is an alias
 | |
| for srcpeek.
 | |
| </ul>
 | |
| 
 | |
| <h2><u>The CDL Parser</u></h2>
 | |
| 
 | |
| The CDL parser and associated lexer
 | |
| (primarily files "ncgen.y" and "ncgen.l")
 | |
| parse CDL files into various data structures
 | |
| for use by the remaining ncgen code.
 | |
| The data structures described above,
 | |
| (<a href="#Symbol">Symbol</a>, and
 | |
| <a href="#Datalist">Datalist</a>)
 | |
| are primarily generated by the parser.
 | |
| 
 | |
| <h3>Parse Cliches</h3>
 | |
| <h4>Node Stacking</h4>
 | |
| One of the issues that must be addressed by any bottom-up
 | |
| parser is handling the accumulation of sets of items (nodes,
 | |
| etc.).  The YACC/Bison parse stack cannot be used
 | |
| because the set of accumulated nodes is unbounded
 | |
| and the YACC stack mechanism is bounded (i.e. each rule
 | |
| has a bounded right hand side length).
 | |
| <p>
 | |
| The node stacking set of cliches is ubiquitous in the
 | |
| parser, so they must be understood to understand how the
 | |
| parser works.  The cliche here is shown in the handling of,
 | |
| for example, the varlist rule, which is defined as follows.
 | |
| <pre>
 | |
| varlist:   varspec
 | |
|              {$$=listlength(stack); listpush(stack,(elem_t)$1);}
 | |
|          | varlist ',' varspec
 | |
| 	     {$$=$1; listpush(stack,(elem_t)$3);}
 | |
|          ;
 | |
| </pre>
 | |
| The varlist rule collects variable name declarations (via the varspec rule).
 | |
| The idea is to use a separate stack named "stack", and tracking
 | |
| the index into the stack of the start of collection of objects.
 | |
| The varlist value (in the YACC sense) is defined as an integer
 | |
| representing the size of the stack at the start of a list of variables.
 | |
| That is what this code does: <code>$$=listlength(stack)</code>.
 | |
| <p>
 | |
| At the point where the set of varspecs should processed, the following code cliche
 | |
| is used.
 | |
| <pre>
 | |
| vardecl: typeref varlist
 | |
|            {...
 | |
|             stackbase=$2;
 | |
|             stacklen=listlength(stack);
 | |
|             for(i=stackbase;i<stacklen;i++) {
 | |
|               Symbol* sym = (Symbol*)listget(stack,i);
 | |
|               ...
 | |
|             }
 | |
| 	    listsetlength(stack,stackbase);// remove stack nodes
 | |
| 	   }
 | |
|            ...
 | |
| </pre>
 | |
| The start of the set of variable declaration symbols is extracted
 | |
| as the integer associated with right-side non-terminal $2, e.g.
 | |
| <code>stackbase=$2</code>.
 | |
| The current stack length is obtained from <code>stacklen=listlength(stack)</code>.
 | |
| Then the elements of the stack are extracted one by one using the above loop.
 | |
| Finally, the nodes on the stack are cleared by the code segment
 | |
| <code>listsetlength(stack,stackbase)</code>.
 | |
| 
 | |
| <h4><u>Semantic Processing</u></h4>
 | |
| Semantic processing takes the output of the parser
 | |
| and adds various pieces of semantic information.
 | |
| The semantic actions are as follows.
 | |
| <ol>
 | |
| <li> Procedure processtypes().
 | |
|      <ol>
 | |
|      <li>Do a topological sort of the types based on dependency
 | |
|          so that the least dependent are first in the typdefs list.
 | |
|      <li>Fill in type typecodes.
 | |
|      <li>Mark types that have a vlen.
 | |
|      </ol>
 | |
| <li> Procedure filltypecodes() - Fill in implied type codes.
 | |
| <li> Procedure processvars() - Fill in missing values.
 | |
| <li> Procedure processattributes() -
 | |
|      Process attributes to connect to corresponding variable.
 | |
| <li> Procedure processcompound() -
 | |
|      Process each compound type to compute its size.
 | |
| <li> Procedure processenums() -
 | |
|      Fix up enum constant values.
 | |
| <li> Procedure processdatalists() -
 | |
|      Fix up datalists.
 | |
| <li> Procedure checkconsistency() -
 | |
|      Check internal consistency.
 | |
| <li> Procedure validate() -
 | |
|      Do any needed additional semantic checks.
 | |
| </ol>
 | |
| 
 | |
| <h2><u>Generating C Code</u></h2>
 | |
| The source code for generating C code output (via the -c option)
 | |
| is of most interest because it is the pattern to be used
 | |
| for other languages and because, frankly, it is complex and ugly
 | |
| at the moment and so guidance is needed in understanding it.
 | |
| <p>
 | |
| The files genc.[ch] and cdata.c are the primary files for C code generation.
 | |
| The files data.[ch] is also important.
 | |
| 
 | |
| <h3><u>Output Routines</u></h3>
 | |
| The output routines are a bit of a mixed bag.
 | |
| It is important to know that code is not directly
 | |
| dumped to the output file; rather is is accumulated
 | |
| in a global Bytebuffer instance called "ccode".
 | |
| <p>
 | |
| The output routines are as follows.
 | |
| <ul>
 | |
| <li>flushcode(void) - flush the ccode buffer to the output file.
 | |
| <li>cprint(Bytebuffer* buf) - dump the contents 
 | |
| of buf to the ccode buffer.
 | |
| <li>cpartial(char* line) - dump the contents of line
 | |
| to the ccode buffer, but do not add a trailing newline.
 | |
| <li>cline(char* line) - dump the contents of line
 | |
| to the ccode buffer and add a trailing newline.
 | |
| <li>clined(int n, char* line) - dump the contents of line to the ccode
 | |
| buffer; prefix with n indentations (typically 4 blanks each)
 | |
| and suffix with a trailing newline.
 | |
| </ul>
 | |
| 
 | |
| <h3><u>gen_ncc</u></h3>
 | |
| The gen_ncc procedure is responsible for
 | |
| creating and dumping the generated C code.
 | |
| <p>
 | |
| It has at its disposal several global lists of Symbols.
 | |
| Note that the lists cross all groups.
 | |
| <ul>
 | |
| <li>dimdefs - the set of symbols defining dimensions.
 | |
| <li>vardefs - the set of symbols defining variables.
 | |
| <li>attdefs - the set of symbols defining non-global attributes.
 | |
| <li>gattdefs - the set of symbols defining global attributes.
 | |
| <li>grpdefs - the set of symbols defining groups.
 | |
| <li>typdefs - the set of symbols defining types; note that this list
 | |
| has been topologically sorted so that a given type depends only
 | |
| on types with lower indices in the list.
 | |
| </ul>
 | |
| <p>
 | |
| The superficial operation of gen_ncc is as follows; the details
 | |
| are provided later where the operation is complex.
 | |
| <ol>
 | |
| <li>Generate header code (e.g. #include <stdio.h>").
 | |
| <li>Generate C type definitions corresponding to the
 | |
| CDL types.
 | |
| <li>Generate VLEN constants.
 | |
| <li>Generate chunking constants.
 | |
| <li>Generate initial part of the main() procedure.
 | |
| <li>Generate C variable definitions to hold the ncids
 | |
| for all created groups.
 | |
| <li>Generate C variable definitions to hold the typeids
 | |
| of all created types.
 | |
| <li>Generate C variables and constants that correspond to
 | |
| to the CDL dimensions.
 | |
| <li>Generate C variable definitions to hold the varids
 | |
| of all created variables.
 | |
| <li>Generate C code to create the netCDF binary file.
 | |
| <li>Generate C code to create the all groups in the proper
 | |
| hierarchy.
 | |
| <li>Generate C code to create the type definitions.
 | |
| <li>Generate C code to create the dimension definitions.
 | |
| <li>Generate C code to create the variable definitions.
 | |
| <li>Generate C code to create the global attributes.
 | |
| <li>Generate C code to create the non-global attributes.
 | |
| <li>Generate C code to leave define mode.
 | |
| <li>Generate C code to assign variable datalists.
 | |
| </ol>
 | |
| <p>
 | |
| The following code generates C code for defining the groups.
 | |
| It is fairly canonical and can be seen repeated in variant form
 | |
| when defining dimensions, types, variables, and attributes.
 | |
| <p>
 | |
| This code is redundant but for consistency, the root group
 | |
| ncid is stored like all other group ncids.
 | |
| Note that nprintf is a macro wrapper around snprint.
 | |
| <pre>
 | |
| nprintf(stmt,sizeof(stmt),"    %s = ncid;",groupncid(rootgroup));
 | |
| cline(stmt);
 | |
| </pre>
 | |
| <p>
 | |
| The loop walks all group symbols in preorder form
 | |
| and generates C code call to nc_def_grp
 | |
| using parameters taken from the group Symbol instance (gsym).
 | |
| The call to nc_def_grp is succeeded by a call to the
 | |
| check_err procedure to verify the operation's result code.
 | |
| <pre>
 | |
| for(igrp=0;igrp<listlength(grpdefs);igrp++) {
 | |
|     Symbol* gsym = (Symbol*)listget(grpdefs,igrp);
 | |
|     if(gsym == rootgroup) continue; // ignore root
 | |
|     if(gsym->container == NULL) PANIC("null container");
 | |
|     nprintf(stmt,sizeof(stmt),
 | |
|             "    stat = nc_def_grp(%s, \"%s\", &%s);",
 | |
|             groupncid(gsym->container),
 | |
|             gsym->name, groupncid(gsym));
 | |
|     cline(stmt); // print the def_grp call
 | |
|     clined(1,"check_err(stat,__LINE__,__FILE__);");
 | |
| }
 | |
| flushcode();
 | |
| </pre>
 | |
| <p>
 | |
| The code to generate dimensions, types, attributes, variables
 | |
| is similar, although often more complex.
 | |
| <p>
 | |
| The code to generate C equivalents of CDL types is
 | |
| in the procedure definectype().
 | |
| Note that this code is not the code that invokes e.g. nc_def_vlen.
 | |
| The generated C types are used when generating datalists
 | |
| so that the standard C constant assignment mechanism will produce
 | |
| the correct memory values.
 | |
| <p>
 | |
| The genc_deftype procedure is the one that actually
 | |
| generates C code to define the netcdf types.
 | |
| The generated C code is designed to store the resulting
 | |
| typeid into the C variable defined earlier
 | |
| for holding that typeid.
 | |
| <p>
 | |
| Note that for compound types, the NC_COMPOUND_OFFSET
 | |
| macro is normally used to match netcdf offsets to
 | |
| the corresponding struct type generated in definectype.
 | |
| However, there is a flag, TESTALIGNMENT,
 | |
| that can be set to use a computed value for the offset.
 | |
| 
 | |
| <h3><u>C Constant Datalist Generation</u></h3>
 | |
| All attributes, and some variables, require the
 | |
| construction of a memory object containing data
 | |
| to be assigned to that attribute or variable.
 | |
| The code to do this is by far the most complicated
 | |
| in ncgen.
 | |
| The file cdata.c contains the procedure genc_datalist(),
 | |
| which does most of the heavy lifting.
 | |
| <p>
 | |
| For attributes, the general form generated is 
 | |
| <pre>
 | |
| T* attributevar = {...};
 | |
| </pre>
 | |
| Except for VLENs, the datalist is completely
 | |
| contained in the brackets, with bracket nesting as required.
 | |
| A generated pointer the attributevar is included
 | |
| in the generated call to nc_put_att().
 | |
| <p>
 | |
| For variables, the general form generated is similar to attributes.
 | |
| <pre>
 | |
| T* varvar = {...};
 | |
| </pre>
 | |
| Again, VLENs are handled specially.
 | |
| Also, for performance purposes, the datalist
 | |
| is loaded in pieces using nc_put_vara(). This is required if
 | |
| there are UNLIMITED dimensions, but is used for all cases
 | |
| for uniformity.
 | |
| 
 | |
| <h4>Datalist Closures</h4>
 | |
| The code uses a concept of closure or callback
 | |
| to allow the datalist processing to periodically
 | |
| call external code to do the actual C code generation.
 | |
| Basically, each call to the callback will generate
 | |
| C code for constants and calls to nc_put_vara().
 | |
| The closure data structure (struct Putvar) is defined as follows.
 | |
| <pre>
 | |
| typedef struct Putvar {
 | |
|     int (*putvar)(struct Putvar*, Odometer*, Bytebuffer*);
 | |
|     int rank;
 | |
|     Bytebuffer* code;
 | |
|     size_t startset[NC_MAX_VAR_DIMS];
 | |
|     struct CDF {
 | |
|         int grpid;
 | |
|         int varid;
 | |
|     } cdf;
 | |
|     struct C {
 | |
|         Symbol* var;
 | |
|     } c;
 | |
| } Putvar;
 | |
| </pre>
 | |
| An instance of the closure is created for
 | |
| each variable that is the target of nc_put_vara().
 | |
| It is initialized with the variable's symbol, rank, group id and variable
 | |
| id. It is also provided with a Bytebuffer into which it is supposed
 | |
| to store the generated C code.
 | |
| The startset is the cached previous set of dimension indices used
 | |
| for generating the nc_put_vara (see below).
 | |
| <p>
 | |
| The callback procedure (field "putvar")
 | |
| for generating C code putvar is assigned to the procedure called cputvara()
 | |
| (defined in genc.c).
 | |
| This procedure takes as arguments the closure object,
 | |
| an odometer describing the current set of dimension indices,
 | |
| and a Bytebuffer containing the generated C constants
 | |
| to be assigned to this slice of the variable.
 | |
| <p>
 | |
| Every time the closure procedure is called, it generates a C variable
 | |
| to hold the generated C constant. It then generates an nc_put_vara()
 | |
| call. The start vector argument for the nc_put_vara is defined by the startset
 | |
| field of the closure. The count vector argument to nc_put_vara
 | |
| is computed from the current cached 
 | |
| start vector and from the indices in the odometer.
 | |
| After the nc_put_vara() is generated, the odometer vector
 | |
| is assigned to the startset field in the closure for use on the next call.
 | |
| <p>
 | |
| There are some important assumptions about the state of the odometer
 | |
| when it is called.
 | |
| <ol>
 | |
| <li>The zeroth index controls the count set.
 | |
| <li>All other indices are assumed to be at their max values.
 | |
| </ol>
 | |
| <p>
 | |
| In particular, this means that the start vector is zero
 | |
| for all positions except position zero. The count vector
 | |
| is positions, except zero is the index in the odometer,
 | |
| which is assumed to be the max.
 | |
| <p>
 | |
| For start position zero, the position is taken from the last
 | |
| saved startset. The count position zero is the difference between
 | |
| that last start position and the current odometer zeroth index.
 | |
| <p>
 | |
| If all of this sounds complex, it is, and if/when I have time
 | |
| I will rethink the whole process of datalist generation
 | |
| from beginning to end.
 | |
| 
 | |
| <h4>VLEN Constants</h4>
 | |
| VLEN constants need to be constructed
 | |
| as separate C data constants because
 | |
| the C compiler will never convert nested
 | |
| groups ({...}) to separate memory chunks.
 | |
| Thus, ncgen must in several places
 | |
| generate the VLEN constants as separate variables
 | |
| and then insert pointers to them in the appropriate
 | |
| places in the later datalist C constants.
 | |
| <p>
 | |
| As an optimization, ncgen tracks which datatypes
 | |
| will require use of vlen constants.
 | |
| This is any type whose definition is a vlen or whose
 | |
| basetype contains a vlen type.
 | |
| <p>
 | |
| The vlen generation process is two-fold.
 | |
| First, in the procedure processdatalist1() in semantics.c,
 | |
| the location of the struct Datalist objects
 | |
| that correspond to vlen constants is stored in a list called vlenconstants.
 | |
| When detected, each such struct Datalist object is tagged with
 | |
| a unique identifier and the vlen length (count).
 | |
| These will be used later to generate references to the vlen constant.
 | |
| <p>
 | |
| The second vlen constant processing action is in the
 | |
| procedure genc_vlenconstant() in cdata.c First, it walks the
 | |
| vlenconstants list and generates C code for variables to
 | |
| define the vlen constant and C code to assign the vlen
 | |
| constant's data to that variable.
 | |
| <p>
 | |
| When, later, the genc_datalist procedure encounters
 | |
| a Datalist tagged as representing a data list, it can generate
 | |
| a nc_vlen_t constant as {<count>,<vlenconstantname>}
 | |
| and use it directly in the generated C datalist constant.
 | |
| 
 | |
| <h4>Walking the Datalist</h4>
 | |
| To actually generate the C code for a datalist constant,
 | |
| the procedure genc_datalist wraps the Datalist in a Datasrc,
 | |
| and proceeds to walk it constant by constant and generating
 | |
| the corresponding C constant. The bulk of the work
 | |
| is performed in the recursive procedure genc_datalist1().
 | |
| <p>
 | |
| For better or worse, the code
 | |
| acts like a 1-lookahead parser. This means that it decides
 | |
| what to do based on the current type, the current constant and, when necessary,
 | |
| the next constant in the Datasrc. In practice, the lookahead
 | |
| is hidden, so it is not represented in the following table.
 | |
| <p>
 | |
| <table border=1>
 | |
| <tr><th>Current Type<th>Current Constant<th>action
 | |
| <tr valign=top><td>NC_PRIM<td>Primitive Constant<td>Generate the C constant; convert as necessary.
 | |
| <tr valign=top><td>NC_OPAQUE<td>''<td>''
 | |
| <tr valign=top><td>NC_ENUM<td>''<td>''
 | |
| <tr valign=top><td>NC_ENUM<td>''<td>''
 | |
| <tr valign=top><td>NC_COMPOUND<td>Nested Datalist Constant<td>Push into the datalist and recurse on each field; When done, pop back to previous datalist.
 | |
| <tr valign=top><td>NC_COMPOUND<td>Any other Constant<td>
 | |
| Continue to recurse on each field; This allows
 | |
| specification of fields without enclosing in {...}.
 | |
| <tr valign=top><td>NC_VLEN<td>Nested Datalist Constant<td>Generate the
 | |
| nc_vlen_t instance using the tagged information in the struct Datalist.
 | |
| <tr valign=top><td>NC_FIELD<td>NA<td>If this field is dimensioned,
 | |
| then call genc_fielddata to walk the dimensions. Otherwise, just
 | |
| recurse on genc_datalist1.
 | |
| </table>
 | |
| <p>
 | |
| The genc_fielddata() procedure iterates over a field dimension
 | |
| and calls itself recursively to walk the remaining dimensions.
 | |
| It this is the last dimension, then it calls genc_datalist1 to
 | |
| generate C code for the basetype of the field.
 | |
| 
 | |
| <h4>String/Char Handling</h4>
 | |
| All through the genc_datalist code,
 | |
| there are special cases for handling string constants.
 | |
| The reason is, of course, that the string constant "abcd.."
 | |
| may, depending on the type context, be either a string
 | |
| or an array of characters.
 | |
| 
 | |
| <h4>Generating Variable Data</h4>
 | |
| The genc_datalist code does not call closures.
 | |
| The closures are used in the genc_vardata() and genc_vardata1()
 | |
| procedures; genc_vardata1 being the recursive procedure that actually
 | |
| calls the closure.
 | |
| <p>
 | |
| The genc_vardata1() procedure, like genc_fielddata,
 | |
| iterates over a top-level dimension and calls itself recursively
 | |
| to iterate over the remaining dimensions.
 | |
| The term "top-level" refers to the fact that these are the dimensions
 | |
| specified for a variable as opposed to field dimensions.
 | |
| <p>
 | |
| When iterating an UNLIMITED dimension, or when iterating the first
 | |
| dimension, the code generates a datalist for this subslice
 | |
| and then calls the closure to generate the C code.
 | |
| 
 | |
| 
 | |
| <h2><u>Miscellaneous</u></h2>
 | |
| <h4>Pool Memory Allocation</h4>
 | |
| As an approximation to garbage collection,
 | |
| this code uses a pool allocation mechanism.
 | |
| The goal is to allow dynamic construction of strings
 | |
| that have very short life-times; typically they are used
 | |
| to construct strings to send to the output file.
 | |
| <p>
 | |
| The pool mechanism wraps malloc and records the malloc'd
 | |
| memory in a circular buffer. When the buffer reaches its maximum
 | |
| size, previously allocated pool buffers are free'd.
 | |
| This is good in that the user does not have to litter
 | |
| code with free() statements. It is bad in that the pool
 | |
| allocated memory can be free'd too early if the memory
 | |
| does not have a short enough life.
 | |
| If you suspect the latter, then bump the size of the circular buffer
 | |
| and see if the problem goes away. If so, then your code
 | |
| is probably holding on to a pool buffer too long and should use
 | |
| regular malloc/free.
 | |
| <p>
 | |
| In the end, I am not sure if this is a good idea, but
 | |
| if does make the code simpler.
 | |
| 
 | |
| <h4>List and Bytebuffer</h4>
 | |
| The two datatypes List and Bytebuffer are used through out the
 | |
| code.  They correspond closely in semantics to the Java Arraylist
 | |
| and Stringbuffer types, respectively.  They are used to help
 | |
| encapsulate dynamically growing lists of objects or bytes
 | |
| to reduce certain kinds of memory allocation errors.
 | |
| <p>
 | |
| The canonical code for non-destructive walking of a List<T>
 | |
| is as follows.
 | |
| <pre>
 | |
| for(i=0;i<listlength(list);i++) {
 | |
|     T* element = (T*)listget(list,i);
 | |
|     ...
 | |
| }
 | |
| </pre>
 | |
| <p>
 | |
| Bytebuffer provides two ways to access its internal buffer of characters.
 | |
| One is "bbContents()", which returns a direct pointer to the buffer,
 | |
| and the other is "bbDup()", which returns a malloc'd string containing
 | |
| the contents and is guaranteed to be null terminated.
 | |
| 
 | |
| <h4><a name="odometer">Odometer: Multi-Dimensional Array Handling</a></h4>
 | |
| The odometer data type is used to convert 
 | |
| multiple dimensions into a single integer.
 | |
| The rule for converting a multi-dimensional
 | |
| array to a single dimensions is as follows.
 | |
| <p>
 | |
| Suppose we have the declaration <code>int F[2][5][3];</code>.
 | |
| There are obviously a total of 2 X 5 X 3 = 30 integers in F.
 | |
| Thus, these three dimensions will be reduced to a single dimension of size 30.
 | |
| <p>
 | |
| A particular point in the three dimensions, say [x][y][z], is reduced to
 | |
| a number in the range 0..29 by computing <code>((x*5)+y)*3+z</code>.
 | |
| The corresponding general C code is as follows.
 | |
| <pre>
 | |
| size_t
 | |
| dimmap(int rank, size_t* indices, size_t* sizes)
 | |
| {
 | |
|     int i;
 | |
|     size_t count = 0;
 | |
|     for(i=0;i<rank;i++) {
 | |
| 	if(i > 0) count *= sizes[i];
 | |
| 	count += indices[i];
 | |
|     }
 | |
|     return count;
 | |
| }
 | |
| </pre>
 | |
| In this code, the indices variable corresponds to the x,y, and z.
 | |
| The sizes variable corresponds to the 2,5, and 3.
 | |
| <p>
 | |
| The Odometer type stores a set of dimensions
 | |
| and supports operations to iterate over all possible
 | |
| dimension combinations.
 | |
| The definition of Odometer is defined by the types Odometer and Dimdata.
 | |
| <pre>
 | |
| typedef struct Dimdata {
 | |
|     unsigned long datasize; // actual size of the datalist item
 | |
|     unsigned long index;    // 0 <= index < datasize
 | |
|     unsigned long declsize;
 | |
| } Dimdata;
 | |
| 
 | |
| typedef struct Odometer {
 | |
|     int     rank;
 | |
|     Dimdata dims[NC_MAX_VAR_DIMS];
 | |
| } Odometer;
 | |
| </pre>
 | |
| The following primary operations are defined.
 | |
| <ul>
 | |
| <li>Odometer* newodometer(Dimset*) - create an odometer from a set of Dimsets.
 | |
| <li>void freeodometer(Odometer*) - release the memory of an odometer.
 | |
| <li>int odometermore(Odometer* odom) - return 1 if there are more combinations
 | |
| of dimension values.
 | |
| <li>int odometerincr(Odometer* odo,int) - move to the next combination
 | |
| of dimension values.
 | |
| <li>unsigned long odometercount(Odometer* odo) -
 | |
| apply the above algorithm to convert the current odometer combination
 | |
| into a single integer.
 | |
| </ul>
 | |
| 
 | |
| <h3><u>Change Log</u></h3>
 | |
| <ul>
 | |
| <li>04/15/2009 - Add major discussion about adding a new output language.
 | |
| <li>03/10/2009 - Fix typos.
 | |
| <li>03/07/2009 - First draft.
 | |
| </ul>
 | |
| 
 | |
| </body
 | |
| </html>
 | |
| 
 | |
| 
 |