Unexpected error in some models of AMD and Intel i7


#1

We are having some problems after compiling successfully the MMG library in some AMD processors ( and at least in one model of Intel i7), when trying to run some 3D examples with the MM3D we get an “Unexpected error”, we tried to use the GDB, but the memory stack is empty, son we don’t have any clue why this is happening.

I understand that is complicated to guess the origin of the problem, specially if can not replicate the experience. The only information we can give you is the computer model:

In all the cases we are using the last version (5.2.1) and compiling with the last version of GCC (6.2).

The only think I can think that it is related with some kind of instruction not compatible, But it is complicate to guess.

Thank you very much for your comprehension


#2

Hello,

I have not yet try to reproduce your issue but the “Unexpected error” message come from our signal handler. Maybe you will obtain a more explicit message if you disable the signal handler (attached patch (742 Bytes)).

Best Regards,

Algiane


#3

I am sorry for the delay, as I said it was an error basically in my personal computer, and during the last week I haven’t the time to check anything again. I compiled in debug (not just our code, MMG too, the problem was there). With the GDB I identified some lines that look correct:

Thread 1 “python3” received signal SIGILL, Illegal instruction.
0x00007fffdac5f2b9 in _MMG5_scaleMesh (mesh=mesh@entry=0x17514c0, met=met@entry=0x1751fd0)
at /home/vicente/bin/mmg/src/common/scalem.c:196
196 for (k=1; knp+1; k++) {
(gdb) bt
0 0x00007fffdac5f2b9 in _MMG5_scaleMesh (mesh=mesh@entry=0x17514c0, met=met@entry=0x1751fd0)
at /home/vicente/bin/mmg/src/common/scalem.c:196
0x00007fffdac5c687 in MMG3D_mmg3dlib (mesh=0x17514c0, met=0x1751fd0)
at /home/vicente/bin/mmg/src/mmg3d/libmmg3d.c:505

The lines correspond with:

scalem.c:

 for (k=1; k<mesh->np+1; k++) {
  for ( i=0; i<met->size; ++i ) {
     met->m[6*k+i] *= d1;
  }
}

libmmg3d.c:

if ( !_MMG5_scaleMesh(mesh,met) ) _LIBMMG5_RETURN(mesh,met,MMG5_STRONGFAILURE);

Using Valgrind I basically get:

==30562== Warning: set address range perms: large range [0x395db040, 0x4a883870) (defined)
==30562== Warning: set address range perms: large range [0x395db028, 0x4a883888) (noaccess)
==30562== Warning: set address range perms: large range [0x395db040, 0x4a883870) (defined)
==30562== Warning: set address range perms: large range [0x395db028, 0x4a883888) (noaccess)
==30562== Warning: set address range perms: large range [0x395db040, 0x4a883870) (defined)
==30562== Warning: set address range perms: large range [0x395db028, 0x4a883888) (noaccess)

#4

Hello,

Thank you for the precisions.

  • I don’t know if the valgrind warnings are related with your issue:

means that we allocate (and initialize) a large amount of memory.

[quote=",post:3,topic:112]
Warning: set address range perms: large range [0x395db028, 0x4a883888) (noaccess)
[/quote] means that we deallocate a large amount of memory.

Maybe it is a bit strange that the specified address are the same for each operation but I am not a valgrind expert…

  • The SIGILL may be created by a memory corruption:
    • What is the size of your mesh (np)?
    • Can you reproduce this error using the executable on your mesh?
    • Does this error occur on any mesh (for example on a small one: m.mesh (116.6 KB) )?
    • It seems that you are using an anisotropic metric, am I right? Did you use the MMG3D_Set_solSize function to provide your sol size to Mmg?

Thank you by advance,

Regards,

Algiane


#5
  • The size of our mesh is less than 100 nodes, it is a Unittest we run every nignt and should be finished in less than 10 seconds.
  • I didn’t try with the executable, I will try tonight.
  • It happens in all 3D cases I have run until now.
  • Yes, I used MMG3D_Set_solSize , this function does not give any problem in other processors,

Thank you very much


#6

Thank you,

  • Maybe Mmg adds compiler options non compatibles with your processor : can you try to build your project with high level of verbosity:

VERBOSE=1 make

and send me the compiler flags used?

  • can you use gdb to examine what happend in scalem.c? (maybe mesh->np, k or d1 is corrupted or NaN)?
  • Last:
    • are you sure to link with an Mmg library build on the same CPU than you run?
    • are you linking the static or the dynamic Mmg library?

Regards,

Algiane


#7

Right now I can’t do it because just happen with my personal computer, but at least I can tell you that I use the static libraries. I am sure that I am using a library compiled in the same machine (100% sure).


#8

There is not problem with the binary. I just checked.


#9

The problem is apparently in the k, I modified scalem.c:

    } else { //met->size==6
      d1 = 1.0 / (dd*dd);
      fprintf(stderr,"  ## ERROR: VALUE mesh->np+1 %d -- \n",mesh->np+1);
      fprintf(stderr,"  ## ERROR: VALUE d1 %d -- \n",d1);
      for (k=1; k<mesh->np+1; k++) {
          fprintf(stderr,"  ## ERROR: VALUE k %d -- \n",k);
        for ( i=0; i<met->size; ++i ) {
          met->m[6*k+i] *= d1;
        }
      }

But I get this:

No inverted elements found
No inverted conditions found
  %% mmg_test/3D_hessian_test_step=0.mesh OPENED
  %% mmg_test/3D_hessian_test_step=0.sol OPENED
  ## ERROR: VALUE mesh->np+1 794 -- 
  ## ERROR: VALUE d1 1821603581 -- 

 Unexpected error:  *** Illegal instruction

So the problem is when starting the loop, with the k.

Additionally, my compiling output with verbosity: https://dl.dropboxusercontent.com/u/11225823/compiling.out


#10

Hello,

In your compiling output, we can see that you are compiling mmg using the following flags: **-mavx2 -fPIC -fopenmp**.
Can you try, please, to build mmg without any flags (the -mavx2 is maybe not supported by the processor…)?

One additionnal remark, but I don’t think that it change anything: d1 is a double value, thus you must print it using %e.

Regards,

Algiane


#11

OK, I will try at home.

PS: Sorry, I knew C, but now I am used to C++ and I don’t remember how to use fprintf correctly.


#12

Solved, it was solved using-mavx instead of -mavx2, the top list of vector instructions is here https://software.intel.com/sites/landingpage/IntrinsicsGuide/ . One is sure compatible with most machines is -msse3

Thank you very much, and sorry for all this problems


#13

Hello,

I am happy that we have found the origin of your error!

Regards,

Algiane


#14