You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
252 lines
10 KiB
252 lines
10 KiB
2 years ago
|
Problems existing in Zoltan.
|
||
|
This file was last updated on $Date$
|
||
|
|
||
|
-------------------------------------------------------------------------------
|
||
|
ERROR CONDITIONS IN ZOLTAN
|
||
|
When a processor returns from Zoltan to the application due to an error
|
||
|
condition, other processors do not necessarily return the same condition.
|
||
|
In fact, other processors may not know that the processor has quit Zoltan,
|
||
|
and may hang in a communication (waiting for a message that is not sent
|
||
|
due to the error condition). The parallel error-handling capabilities of
|
||
|
Zoltan will be improved in future releases.
|
||
|
-------------------------------------------------------------------------------
|
||
|
RCB/RIB ON ASCI RED
|
||
|
On ASCI Red, the number of context IDs (e.g., MPI Communicators) is limited
|
||
|
to 8192. The environment variable MPI_REUSE_CONTEXT_IDS must be set to
|
||
|
reuse the IDs; setting this variable, however, slows performance.
|
||
|
An alternative is to set Zoltan_Parameter TFLOPS_SPECIAL to "1". With
|
||
|
TFLOPS_SPECIAL set, communicators in RCB/RIB are not split and, thus, the
|
||
|
application is less likely to run out of context IDs. However, ASCI Red
|
||
|
also has a bug that is exposed by TFLOPS_SPECIAL; when messages that use
|
||
|
MPI_Send/MPI_Recv within RCB/RIB exceed the MPI_SHORT_MSG_SIZE, MPI_Recv
|
||
|
hangs. We do not expect these conditions to exist on future platforms and,
|
||
|
indeed, plan to make TFLOPS_SPECIAL obsolete in future versions of Zoltan
|
||
|
rather than re-work it with MPI_Irecv. -- KDD 10/5/2004
|
||
|
-------------------------------------------------------------------------------
|
||
|
ERROR CONDITIONS IN OCTREE, PARMETIS AND JOSTLE
|
||
|
On failure, OCTREE, ParMETIS and Jostle methods abort rather than return
|
||
|
error codes.
|
||
|
-------------------------------------------------------------------------------
|
||
|
ZOLTAN_INITIALIZE BUT NO ZOLTAN_FINALIZE
|
||
|
If Zoltan_Initialize calls MPI_Init, then MPI_Finalize
|
||
|
will never be called because there is no Zoltan_Finalize routine.
|
||
|
If the application uses MPI and calls MPI_Init and MPI_Finalize,
|
||
|
then there is no problem.
|
||
|
-------------------------------------------------------------------------------
|
||
|
HETEROGENEOUS ENVIRONMENTS
|
||
|
Some parts of Zoltan currently assume that basic data types like
|
||
|
integers and real numbers (floats) have identical representation
|
||
|
on all processors. This may not be true in a heterogeneous
|
||
|
environment. Specifically, the unstructured (irregular) communication
|
||
|
library is unsafe in a heterogeneous environment. This problem
|
||
|
will be corrected in a future release of Zoltan for heterogeneous
|
||
|
systems.
|
||
|
-------------------------------------------------------------------------------
|
||
|
F90 ISSUES
|
||
|
Pacific Sierra Research (PSR) Vastf90 is not currently supported due to bugs
|
||
|
in the compiler with no known workarounds. It is not known when or if this
|
||
|
compiler will be supported.
|
||
|
|
||
|
N.A.Software FortranPlus is not currently supported due to problems with the
|
||
|
query functions. We anticipate that this problem can be overcome, and support
|
||
|
will be added soon.
|
||
|
-------------------------------------------------------------------------------
|
||
|
PROBLEMS EXISTING IN PARMETIS
|
||
|
(Reported to the ParMETIS development team at the University of Minnesota,
|
||
|
metis@cs.umn.edu)
|
||
|
|
||
|
Name: Free-memory write in PartGeomKway
|
||
|
Version: ParMETIS 3.1.1
|
||
|
Symptom: Free-memory write reported by Purify and Valgrind for graphs with
|
||
|
no edges.
|
||
|
Description:
|
||
|
For input graphs with no (or, perhaps, few) edges, Purify and Valgrind
|
||
|
report writes to already freed memory as shown below.
|
||
|
FMW: Free memory write:
|
||
|
* This is occurring while in thread 22199:
|
||
|
SetUp(void) [setup.c:80]
|
||
|
PartitionSmallGraph(void) [weird.c:39]
|
||
|
ParMETIS_V3_PartGeomKway [gkmetis.c:214]
|
||
|
Zoltan_ParMetis [parmetis_interface.c:280]
|
||
|
Zoltan_LB [lb_balance.c:384]
|
||
|
Zoltan_LB_Partition [lb_balance.c:91]
|
||
|
run_zoltan [dr_loadbal.c:581]
|
||
|
main [dr_main.c:386]
|
||
|
__libc_start_main [libc.so.6]
|
||
|
_start [crt1.o]
|
||
|
* Writing 4 bytes to 0xfcd298 in the heap.
|
||
|
* Address 0xfcd298 is at the beginning of a freed block of 4 bytes.
|
||
|
* This block was allocated from thread -1781075296:
|
||
|
malloc [rtlib.o]
|
||
|
GKmalloc(void) [util.c:151]
|
||
|
idxmalloc(void) [util.c:100]
|
||
|
AllocateWSpace [memory.c:28]
|
||
|
ParMETIS_V3_PartGeomKway [gkmetis.c:123]
|
||
|
Zoltan_ParMetis [parmetis_interface.c:280]
|
||
|
Zoltan_LB [lb_balance.c:384]
|
||
|
Zoltan_LB_Partition [lb_balance.c:91]
|
||
|
run_zoltan [dr_loadbal.c:581]
|
||
|
main [dr_main.c:386]
|
||
|
__libc_start_main [libc.so.6]
|
||
|
_start [crt1.o]
|
||
|
* There have been 10 frees since this block was freed from thread 22199:
|
||
|
GKfree(void) [util.c:168]
|
||
|
Mc_MoveGraph(void) [move.c:92]
|
||
|
ParMETIS_V3_PartGeomKway [gkmetis.c:149]
|
||
|
Zoltan_ParMetis [parmetis_interface.c:280]
|
||
|
Zoltan_LB [lb_balance.c:384]
|
||
|
Zoltan_LB_Partition [lb_balance.c:91]
|
||
|
run_zoltan [dr_loadbal.c:581]
|
||
|
main [dr_main.c:386]
|
||
|
__libc_start_main [libc.so.6]
|
||
|
_start [crt1.o]
|
||
|
Reported: Reported 8/31/09 http://glaros.dtc.umn.edu/flyspray/task/50
|
||
|
Status: Reported 8/31/09
|
||
|
|
||
|
Name: PartGeom limitation
|
||
|
Version: ParMETIS 3.0, 3.1
|
||
|
Symptom: inaccurate number of partitions when # partitions != # processors
|
||
|
Description:
|
||
|
ParMETIS method PartGeom produces decompositions with #-processor
|
||
|
partitions only. Zoltan parameters NUM_GLOBAL_PARTITIONS and
|
||
|
NUM_LOCAL_PARTITIONS will be ignored.
|
||
|
Reported: Not yet reported.
|
||
|
Status: Not yet reported.
|
||
|
|
||
|
Name: vsize array freed in ParMetis
|
||
|
Version: ParMETIS 3.0 and 3.1
|
||
|
Symptom: seg. fault, core dump at runtime
|
||
|
Description:
|
||
|
When calling ParMETIS_V3_AdaptiveRepart with the vsize parameter,
|
||
|
ParMetis will try to free the vsize array even if it was
|
||
|
allocated in Zoltan. Zoltan will then try to free vsize again
|
||
|
later, resulting in a fatal error. As a temporary fix,
|
||
|
Zoltan will never call ParMetis with the vsize parameter.
|
||
|
Reported: 11/25/2003.
|
||
|
Status: Acknowledged by George Karypis.
|
||
|
|
||
|
Name: ParMETIS_V3_AdaptiveRepart and ParMETIS_V3_PartKWay crash
|
||
|
for zero-sized partitions.
|
||
|
Version: ParMETIS 3.1
|
||
|
Symptom: run-time error "killed by signal 8" on DEC. FPE, divide-by-zero.
|
||
|
Description:
|
||
|
Metis divides by partition size; thus, zero-sized partitions
|
||
|
cause a floating-point exception.
|
||
|
Reported: 9/9/2003.
|
||
|
Status: ?
|
||
|
|
||
|
Name: ParMETIS_V3_AdaptiveRepart dies for zero-sized partitions.
|
||
|
Version: ParMETIS 3.0
|
||
|
Symptom: run-time error "killed by signal 8" on DEC. FPE, divide-by-zero.
|
||
|
Description:
|
||
|
ParMETIS_V3_AdaptiveRepart divides by partition size; thus, zero-sized
|
||
|
partitions cause a floating-point exception. This problem is exhibited in
|
||
|
adaptive-partlocal3 tests. The tests actually run on Sun and Linux machines
|
||
|
(which don't seem to care about the divide-by-zero), but cause an FPE
|
||
|
signal on DEC (Compaq) machines.
|
||
|
Reported: 1/23/2003.
|
||
|
Status: Fixed in ParMetis 3.1, but new problem appeared (see above).
|
||
|
|
||
|
Name: ParMETIS_V3_AdaptiveRepart crashes when no edges.
|
||
|
Version: ParMETIS 3.0
|
||
|
Symptom: Floating point exception, divide-by-zero.
|
||
|
Description:
|
||
|
Divide-by-zero in ParMETISLib/adrivers.c, function Adaptive_Partition,
|
||
|
line 40.
|
||
|
Reported: 1/23/2003.
|
||
|
Status: Fixed in ParMetis 3.1.
|
||
|
|
||
|
Name: Uninitialized memory read in akwayfm.c.
|
||
|
Version: ParMETIS 3.0
|
||
|
Symptom: UMR warning.
|
||
|
Description:
|
||
|
UMR in ParMETISLib/akwayfm.c, function Moc_KWayAdaptiveRefine, near line 520.
|
||
|
Reported: 1/23/2003.
|
||
|
Status: Fixed in ParMetis 3.1.
|
||
|
|
||
|
Name: Memory leak in wave.c
|
||
|
Version: ParMETIS 3.0
|
||
|
Symptom: Some memory not freed.
|
||
|
Description:
|
||
|
Memory leak in ParMETISLib/wave.c, function WavefrontDiffusion;
|
||
|
memory for the following variables is not always freed:
|
||
|
solution, perm, workspace, cand
|
||
|
We believe the early return near line 111 causes the problem.
|
||
|
Reported: 1/23/2003.
|
||
|
Status: Fixed in ParMetis 3.1.
|
||
|
|
||
|
Name: tpwgts ignored for small graphs.
|
||
|
Version: ParMETIS 3.0
|
||
|
Symptom: incorrect output (partitioning)
|
||
|
Description:
|
||
|
When using ParMETIS_V3_PartKway to partition into partitions
|
||
|
of unequal sizes, the input array tpwgts is ignored and
|
||
|
uniform-sized partitions are computed. This bug shows up when
|
||
|
(a) the number of vertices is < 10000 and (b) only one weight
|
||
|
per vertex is given (ncon=1).
|
||
|
Reported: Reported to George Karypis and metis@cs.umn.edu on 2002/10/30.
|
||
|
Status: Fixed in ParMetis 3.1.
|
||
|
|
||
|
|
||
|
Name: AdaptiveRepart crashes on partless test.
|
||
|
Version: ParMETIS 3.0
|
||
|
Symptom: run-time segmentation violation.
|
||
|
Description:
|
||
|
ParMETIS_V3_AdaptiveRepart crashes with a SIGSEGV if
|
||
|
the input array _part_ contains any value greater then
|
||
|
the desired number of partitions, nparts. This shows up
|
||
|
in Zoltan's "partless" test cases.
|
||
|
Reported: Reported to George Karypis and metis@cs.umn.edu on 2002/12/02.
|
||
|
Status: Fixed in ParMetis 3.1.
|
||
|
|
||
|
|
||
|
Name: load imbalance tolerance
|
||
|
Version: ParMETIS 2.0
|
||
|
Symptom: missing feature
|
||
|
Description:
|
||
|
The load imbalance parameter UNBALANCE_FRACTION can
|
||
|
only be set at compile-time. With Zoltan it is
|
||
|
necessary to be able to set this parameter at run-time.
|
||
|
Reported: Reported to metis@cs.umn.edu on 19 Aug 1999.
|
||
|
Status: Fixed in version 3.0.
|
||
|
|
||
|
|
||
|
Name: no edges
|
||
|
Version: ParMETIS 2.0
|
||
|
Symptom: segmentation fault at run time
|
||
|
Description:
|
||
|
ParMETIS crashes if the input graph has no edges and
|
||
|
ParMETIS_PartKway is called. We suspect all the graph based
|
||
|
methods crash. From the documentation it is unclear if
|
||
|
a NULL pointer is a valid input for the adjncy array.
|
||
|
Apparently, the bug occurs both with NULL as input or
|
||
|
a valid pointer to an array.
|
||
|
Reported: Reported to metis@cs.umn.edu on 5 Oct 1999.
|
||
|
Status: Fixed in version 3.0.
|
||
|
|
||
|
|
||
|
Name: no vertices
|
||
|
Version: ParMETIS 2.0, 3.0, 3.1
|
||
|
Symptom: segmentation fault at run time
|
||
|
Description:
|
||
|
ParMETIS may crash if a processor owns no vertices.
|
||
|
The extent of this bug is not known (which methods are affected).
|
||
|
Again, it is unclear if NULL pointers are valid input.
|
||
|
Reported: Reported to metis@cs.umn.edu on 6 Oct 1999.
|
||
|
Status: Fixed in 3.0 and 3.1 for the graph methods, but not the geometric methods.
|
||
|
New bug report sent on 2003/08/20.
|
||
|
|
||
|
|
||
|
Name: partgeom bug
|
||
|
Version: ParMETIS 2.0
|
||
|
Symptom: floating point exception
|
||
|
Description:
|
||
|
For domains where the global delta_x, delta_y, or delta_z (in 3D)
|
||
|
is zero (e.g., all nodes lie along the y-axis), a floating point
|
||
|
exception can occur when the partgeom algorithm is used.
|
||
|
Reported: kirk@cs.umn.edu in Jan 2001.
|
||
|
Status: Fixed in version 3.0.
|
||
|
|
||
|
-------------------------------------------------------------------------------
|
||
|
|