Problems existing in Zoltan. This file was last updated on $Date$ ------------------------------------------------------------------------------- ERROR CONDITIONS IN ZOLTAN When a processor returns from Zoltan to the application due to an error condition, other processors do not necessarily return the same condition. In fact, other processors may not know that the processor has quit Zoltan, and may hang in a communication (waiting for a message that is not sent due to the error condition). The parallel error-handling capabilities of Zoltan will be improved in future releases. ------------------------------------------------------------------------------- RCB/RIB ON ASCI RED On ASCI Red, the number of context IDs (e.g., MPI Communicators) is limited to 8192. The environment variable MPI_REUSE_CONTEXT_IDS must be set to reuse the IDs; setting this variable, however, slows performance. An alternative is to set Zoltan_Parameter TFLOPS_SPECIAL to "1". With TFLOPS_SPECIAL set, communicators in RCB/RIB are not split and, thus, the application is less likely to run out of context IDs. However, ASCI Red also has a bug that is exposed by TFLOPS_SPECIAL; when messages that use MPI_Send/MPI_Recv within RCB/RIB exceed the MPI_SHORT_MSG_SIZE, MPI_Recv hangs. We do not expect these conditions to exist on future platforms and, indeed, plan to make TFLOPS_SPECIAL obsolete in future versions of Zoltan rather than re-work it with MPI_Irecv. -- KDD 10/5/2004 ------------------------------------------------------------------------------- ERROR CONDITIONS IN OCTREE, PARMETIS AND JOSTLE On failure, OCTREE, ParMETIS and Jostle methods abort rather than return error codes. ------------------------------------------------------------------------------- ZOLTAN_INITIALIZE BUT NO ZOLTAN_FINALIZE If Zoltan_Initialize calls MPI_Init, then MPI_Finalize will never be called because there is no Zoltan_Finalize routine. If the application uses MPI and calls MPI_Init and MPI_Finalize, then there is no problem. ------------------------------------------------------------------------------- HETEROGENEOUS ENVIRONMENTS Some parts of Zoltan currently assume that basic data types like integers and real numbers (floats) have identical representation on all processors. This may not be true in a heterogeneous environment. Specifically, the unstructured (irregular) communication library is unsafe in a heterogeneous environment. This problem will be corrected in a future release of Zoltan for heterogeneous systems. ------------------------------------------------------------------------------- F90 ISSUES Pacific Sierra Research (PSR) Vastf90 is not currently supported due to bugs in the compiler with no known workarounds. It is not known when or if this compiler will be supported. N.A.Software FortranPlus is not currently supported due to problems with the query functions. We anticipate that this problem can be overcome, and support will be added soon. ------------------------------------------------------------------------------- PROBLEMS EXISTING IN PARMETIS (Reported to the ParMETIS development team at the University of Minnesota, metis@cs.umn.edu) Name: Free-memory write in PartGeomKway Version: ParMETIS 3.1.1 Symptom: Free-memory write reported by Purify and Valgrind for graphs with no edges. Description: For input graphs with no (or, perhaps, few) edges, Purify and Valgrind report writes to already freed memory as shown below. FMW: Free memory write: * This is occurring while in thread 22199: SetUp(void) [setup.c:80] PartitionSmallGraph(void) [weird.c:39] ParMETIS_V3_PartGeomKway [gkmetis.c:214] Zoltan_ParMetis [parmetis_interface.c:280] Zoltan_LB [lb_balance.c:384] Zoltan_LB_Partition [lb_balance.c:91] run_zoltan [dr_loadbal.c:581] main [dr_main.c:386] __libc_start_main [libc.so.6] _start [crt1.o] * Writing 4 bytes to 0xfcd298 in the heap. * Address 0xfcd298 is at the beginning of a freed block of 4 bytes. * This block was allocated from thread -1781075296: malloc [rtlib.o] GKmalloc(void) [util.c:151] idxmalloc(void) [util.c:100] AllocateWSpace [memory.c:28] ParMETIS_V3_PartGeomKway [gkmetis.c:123] Zoltan_ParMetis [parmetis_interface.c:280] Zoltan_LB [lb_balance.c:384] Zoltan_LB_Partition [lb_balance.c:91] run_zoltan [dr_loadbal.c:581] main [dr_main.c:386] __libc_start_main [libc.so.6] _start [crt1.o] * There have been 10 frees since this block was freed from thread 22199: GKfree(void) [util.c:168] Mc_MoveGraph(void) [move.c:92] ParMETIS_V3_PartGeomKway [gkmetis.c:149] Zoltan_ParMetis [parmetis_interface.c:280] Zoltan_LB [lb_balance.c:384] Zoltan_LB_Partition [lb_balance.c:91] run_zoltan [dr_loadbal.c:581] main [dr_main.c:386] __libc_start_main [libc.so.6] _start [crt1.o] Reported: Reported 8/31/09 http://glaros.dtc.umn.edu/flyspray/task/50 Status: Reported 8/31/09 Name: PartGeom limitation Version: ParMETIS 3.0, 3.1 Symptom: inaccurate number of partitions when # partitions != # processors Description: ParMETIS method PartGeom produces decompositions with #-processor partitions only. Zoltan parameters NUM_GLOBAL_PARTITIONS and NUM_LOCAL_PARTITIONS will be ignored. Reported: Not yet reported. Status: Not yet reported. Name: vsize array freed in ParMetis Version: ParMETIS 3.0 and 3.1 Symptom: seg. fault, core dump at runtime Description: When calling ParMETIS_V3_AdaptiveRepart with the vsize parameter, ParMetis will try to free the vsize array even if it was allocated in Zoltan. Zoltan will then try to free vsize again later, resulting in a fatal error. As a temporary fix, Zoltan will never call ParMetis with the vsize parameter. Reported: 11/25/2003. Status: Acknowledged by George Karypis. Name: ParMETIS_V3_AdaptiveRepart and ParMETIS_V3_PartKWay crash for zero-sized partitions. Version: ParMETIS 3.1 Symptom: run-time error "killed by signal 8" on DEC. FPE, divide-by-zero. Description: Metis divides by partition size; thus, zero-sized partitions cause a floating-point exception. Reported: 9/9/2003. Status: ? Name: ParMETIS_V3_AdaptiveRepart dies for zero-sized partitions. Version: ParMETIS 3.0 Symptom: run-time error "killed by signal 8" on DEC. FPE, divide-by-zero. Description: ParMETIS_V3_AdaptiveRepart divides by partition size; thus, zero-sized partitions cause a floating-point exception. This problem is exhibited in adaptive-partlocal3 tests. The tests actually run on Sun and Linux machines (which don't seem to care about the divide-by-zero), but cause an FPE signal on DEC (Compaq) machines. Reported: 1/23/2003. Status: Fixed in ParMetis 3.1, but new problem appeared (see above). Name: ParMETIS_V3_AdaptiveRepart crashes when no edges. Version: ParMETIS 3.0 Symptom: Floating point exception, divide-by-zero. Description: Divide-by-zero in ParMETISLib/adrivers.c, function Adaptive_Partition, line 40. Reported: 1/23/2003. Status: Fixed in ParMetis 3.1. Name: Uninitialized memory read in akwayfm.c. Version: ParMETIS 3.0 Symptom: UMR warning. Description: UMR in ParMETISLib/akwayfm.c, function Moc_KWayAdaptiveRefine, near line 520. Reported: 1/23/2003. Status: Fixed in ParMetis 3.1. Name: Memory leak in wave.c Version: ParMETIS 3.0 Symptom: Some memory not freed. Description: Memory leak in ParMETISLib/wave.c, function WavefrontDiffusion; memory for the following variables is not always freed: solution, perm, workspace, cand We believe the early return near line 111 causes the problem. Reported: 1/23/2003. Status: Fixed in ParMetis 3.1. Name: tpwgts ignored for small graphs. Version: ParMETIS 3.0 Symptom: incorrect output (partitioning) Description: When using ParMETIS_V3_PartKway to partition into partitions of unequal sizes, the input array tpwgts is ignored and uniform-sized partitions are computed. This bug shows up when (a) the number of vertices is < 10000 and (b) only one weight per vertex is given (ncon=1). Reported: Reported to George Karypis and metis@cs.umn.edu on 2002/10/30. Status: Fixed in ParMetis 3.1. Name: AdaptiveRepart crashes on partless test. Version: ParMETIS 3.0 Symptom: run-time segmentation violation. Description: ParMETIS_V3_AdaptiveRepart crashes with a SIGSEGV if the input array _part_ contains any value greater then the desired number of partitions, nparts. This shows up in Zoltan's "partless" test cases. Reported: Reported to George Karypis and metis@cs.umn.edu on 2002/12/02. Status: Fixed in ParMetis 3.1. Name: load imbalance tolerance Version: ParMETIS 2.0 Symptom: missing feature Description: The load imbalance parameter UNBALANCE_FRACTION can only be set at compile-time. With Zoltan it is necessary to be able to set this parameter at run-time. Reported: Reported to metis@cs.umn.edu on 19 Aug 1999. Status: Fixed in version 3.0. Name: no edges Version: ParMETIS 2.0 Symptom: segmentation fault at run time Description: ParMETIS crashes if the input graph has no edges and ParMETIS_PartKway is called. We suspect all the graph based methods crash. From the documentation it is unclear if a NULL pointer is a valid input for the adjncy array. Apparently, the bug occurs both with NULL as input or a valid pointer to an array. Reported: Reported to metis@cs.umn.edu on 5 Oct 1999. Status: Fixed in version 3.0. Name: no vertices Version: ParMETIS 2.0, 3.0, 3.1 Symptom: segmentation fault at run time Description: ParMETIS may crash if a processor owns no vertices. The extent of this bug is not known (which methods are affected). Again, it is unclear if NULL pointers are valid input. Reported: Reported to metis@cs.umn.edu on 6 Oct 1999. Status: Fixed in 3.0 and 3.1 for the graph methods, but not the geometric methods. New bug report sent on 2003/08/20. Name: partgeom bug Version: ParMETIS 2.0 Symptom: floating point exception Description: For domains where the global delta_x, delta_y, or delta_z (in 3D) is zero (e.g., all nodes lie along the y-axis), a floating point exception can occur when the partgeom algorithm is used. Reported: kirk@cs.umn.edu in Jan 2001. Status: Fixed in version 3.0. -------------------------------------------------------------------------------