Discussion:
Solaris 10 06/06 x86 HP DL585 boot hang aftrer reboot HELP!!!
Dmitry
2006-07-15 11:09:33 UTC
Permalink
Solaris 10 06/06 x86 HP DL585 boot hang aftrer reboot

when I shutdown server and power on,
solaris sometimes start, and working good,
sometime hang

but after "reboot" (or "init 6"), usually boot hang on
"SunOS Release 5.10 Version....
Copyright ...
Use is subject to license terms"



HELP!!!


This message posted from opensolaris.org
Doug Scott
2006-07-15 12:02:09 UTC
Permalink
Post by Dmitry
Solaris 10 06/06 x86 HP DL585 boot hang aftrer
reboot
when I shutdown server and power on,
solaris sometimes start, and working good,
sometime hang
but after "reboot" (or "init 6"), usually boot hang
on
"SunOS Release 5.10 Version....
Copyright ...
Use is subject to license terms"
HELP!!!
Very Strange...

To see what kerrnel module is being loaded at the time you can either add
set moddebug=0x80000000
to /etc/system, or do a 'reboot -- -kd'. This should reboot and put straight
into the debugger. From there you can do a
moddebug/W0x80000000
:c
The ':c' at this point continues the system boot.

Hopefully this should show you the modules load, and may see which one
is giving you a problem

Doug


This message posted from opensolaris.org
Dmitry
2006-07-15 15:04:35 UTC
Permalink
Thanks,
set moddebug=0x80000000 in /etc/system
a few reboots

and ...
hang on
[b]installing acpica, module id 11[/b]

i am novice at Solaris.

and brfore it, i do test Oracle, and system hang in work!!!
it's correlated with boot hang?

it's possible resolve it?

in DL585 BIOS does't has any settings for ACPI
DL585 4 x dual core Opteron 2.4G 32G RAM
latest firmware BIOS
only settings OS (Linux, Windows 2003, Others), i try always, the same hang

i'am shocked


This message posted from opensolaris.org
Dana H. Myers
2006-07-15 15:41:45 UTC
Permalink
Post by Dmitry
Thanks,
set moddebug=0x80000000 in /etc/system
a few reboots
and ...
hang on
[b]installing acpica, module id 11[/b]
i am novice at Solaris.
and brfore it, i do test Oracle, and system hang in work!!!
it's correlated with boot hang?
it's possible resolve it?
in DL585 BIOS does't has any settings for ACPI
DL585 4 x dual core Opteron 2.4G 32G RAM
latest firmware BIOS
only settings OS (Linux, Windows 2003, Others), i try always, the same hang
What module is loaded just before acpica ?

To get some idea if this hang is related to ACPI, you could
try booting with the boot option "acpi-user-options=0x8"; you can
do this by entering the GRUB menu during boot and adding

-B acpi-user-options=0x8

to the end of the 'kernel' line.

However, I suspect this may be another instance of CR 6401605, which
is unrelated to ACPI. See:

http://blogs.sun.com/roller/page/anish?entry=x64_solaris_installation_could_fail

for a way to test this.

Dana
Jean-François Ndi
2006-07-15 15:55:42 UTC
Permalink
Hi,

you can try this:

- boot in the debugger (-kd)
- ::bp acpica`_init (note the backtick character)
- :c

When you hit the break point:
- ::step over

this, until you either hang or see another module getting installed. (I know, not the most elegant but it will help you determine what is going on.). If another module gets loaded retry the experiment (here I assume it will not hang).

If a hang occurs, just note the last instruction executed.

If you don't like to play in the debugger, use the procedure described by Doug but add the v option (-kvd). More information could be provided.
i try to boot with,
-B acpi-user-options=0x2
the same hang
AFAIK, this property is used later in the code. So I think that it will not be useful in that case.

Hope that can help.

Regards,

J-F


This message posted from opensolaris.org
Dana H. Myers
2006-07-15 16:14:31 UTC
Permalink
Post by Jean-François Ndi
Hi,
- boot in the debugger (-kd)
- ::bp acpica`_init (note the backtick character)
- :c
- ::step over
this, until you either hang or see another module getting installed. (I know, not the most elegant but it will help you determine what is going on.). If another module gets loaded retry the experiment (here I assume it will not hang).
acpica_init is called several times, but only the first call will cause an
initialization; either this succeeds or fails, but subsequent calls are
effectively ignored.

Note that acpica is loaded very early in boot; pci_autoconfig uses it,
so a hang during PCI initialiazation may appear to be related to loading
acpica.
Post by Jean-François Ndi
If a hang occurs, just note the last instruction executed.
If you don't like to play in the debugger, use the procedure described by Doug but add the v option (-kvd). More information could be provided.
i try to boot with,
-B acpi-user-options=0x2
the same hang
I'd suggest trying acpi-user-options=0x8 first, then acpi-user-options=0x4,
and finally acpi-user-options=0x2. See:

http://blogs.sun.com/roller/page/danasblog?entry=configuring_solaris_acpi_at_boot

acpi-user-options=0x2 will prevent the ACPI CA subsystem from attempting
to initialize in acpica_init:

357 /*
358 * Make sure user options are processed,
359 * then fail to initialize if ACPI CA has been
360 * disabled
361 */
362 acpica_process_user_options();
363 if (!acpica_enable)
364 return (AE_ERROR);

(from http://cvs.opensolaris.org/source/xref/on/usr/src/uts/i86pc/io/acpica/acpica.c)


However, did I miss something? Did Dmitry already attempt booting with
acpi-user-options=0x2 and the hang still occurs?

Dana
Dmitry
2006-07-16 07:26:49 UTC
Permalink
in -kd hang not present,


i attach full screenshot with set moddebug=0x80000000


This message posted from opensolaris.org
Jean-François Ndi
2006-07-16 08:01:23 UTC
Permalink
Hi,
Post by Dmitry
in -kd hang not present,
i attach full screenshot with set moddebug=0x80000000
second file in the first mesaage
Can you add the v option to the kernel line (-kvd)? Maybe the system will show something after loading acpica but before hanging.

Any better idea?

Regards,

J-F


This message posted from opensolaris.org
Tiffany Ongsiaco
2006-07-18 01:40:44 UTC
Permalink
[b]The Solaris is being revamped because of Sun going through new construction with Scott.[/b]


This message posted from opensolaris.org
Jürgen Keil
2006-07-16 08:46:27 UTC
Permalink
Post by Dmitry
Solaris 10 06/06 x86 HP DL585 boot hang aftrer reboot
when I shutdown server and power on,
solaris sometimes start, and working good,
sometime hang
That is, if you power cycle the machine, it will boot without hanging,
at least sometimes?

One idea is that we have two or more devices sharing the interrupt vector
that acpi is using, and we somehow have a pending interrupt on one of the
other devices, when the acpi interrupt handler is installed.
Try to boot the machine; when the machine is running you can print the
interrupt vector table with the command:

echo ::interrupts | mdb -k
(I hope the ::interrupts command is supported by S10 6/2006)

acpi_wrapper_isr is the interrupt handler for acpi; is there another
device sharing the same vector?

The other idea is that there's a problem with the solaris acpi module
when used on that hp dl585 machine. Maybe the acpi bios table on
that hp machine sends some acpi events that Solaris doesn't understand?
That could also be the explanation that the machine sometimes hangs
when you test oracle.


This message posted from opensolaris.org
Dmitry
2006-07-16 09:28:26 UTC
Permalink
Post by Jürgen Keil
That is, if you power cycle the machine, it will boot
without hanging,
at least sometimes?
if power off, on - usually it boot (sometime not)
Post by Jürgen Keil
echo ::interrupts | mdb -k
ope the ::interrupts command is supported by S10
6/2006)
[b]# echo ::interrupts | mdb -k
mdb: invalid command '::interrupts': unknown dcmd name[/b]
Post by Jürgen Keil
The other idea is that there's a problem with the
solaris acpi module
when used on that hp dl585 machine. Maybe the acpi
bios table on
that hp machine sends some acpi events that Solaris
doesn't understand?
That could also be the explanation that the machine
sometimes hangs
when you test oracle.
how to test it, or resolve?

[b]to jfndi[/b]
with -kv few reboot not hang,
without -kv, again hang


This message posted from opensolaris.org
Jean-François Ndi
2006-07-16 10:04:31 UTC
Permalink
Post by Dmitry
[b]to jfndi[/b]
with -kv few reboot not hang,
without -kv, again hang
I just began to look at the ACPI code and spec. I have no furher idea for now. If I have another one I will let you know.

Regards,

J-F


This message posted from opensolaris.org
Dmitry
2006-07-16 10:36:23 UTC
Permalink
Post by Jean-François Ndi
I just began to look at the ACPI code and spec. I
have no furher idea for now. If I have another one I
will let you know.
thanks, i wait impatiently ...
i must to make decision, use Solaris or not


This message posted from opensolaris.org
Jürgen Keil
2006-07-16 10:47:55 UTC
Permalink
Post by Dmitry
Post by Jürgen Keil
echo ::interrupts | mdb -k
I hope the ::interrupts command is supported by S10 6/2006)
[b]# echo ::interrupts | mdb -k
mdb: invalid command '::interrupts': unknown dcmd
name[/b]
Too bad, the mdb "::interrupts" command isn't backported to S10U2.
This is a new feature in OpenSolaris, though.

On a OpenSolaris build 45 box, it would print something like this:
[b]
# echo ::interrupts | mdb -k
IRQ Vector IPL(lo/hi) Bus Share ISR(s)
0 0x20 14/14 - 1 cbe_fire
1 0x21 5/5 ISA 1 i8042_intr
3 0x23 12/12 ISA 1 asyintr
4 0x24 12/12 ISA 1 asyintr
5 0x25 1/9 PCI 5 ohci_intr, ata_intr, hci1394_isr, ohci_intr,
audioens_intr
6 0x26 5/5 ISA 1 fdc_intr
9 0x29 9/9 - 1 acpi_wrapper_isr
10 0x2a 1/6 PCI 4 iprb_intr, ohci_intr, uhci_intr, uhci_intr
11 0x2b 1/1 PCI 1 ehci_intr
12 0x2c 5/5 ISA 1 i8042_intr
14 0x2e 5/5 PCI 1 ata_intr
15 0x2f 5/5 PCI 1 ata_intr
[/b]
Post by Dmitry
Post by Jürgen Keil
The other idea is that there's a problem with the
solaris acpi module when used on that hp dl585 machine.
Maybe the acpi bios table on that hp machine
sends some acpi events that Solaris doesn't understand?
and it cause to hang?!!!
There have been issues in the past where the bios acpi code
was sending lots of (>10000) interrupts/second.

For example, see this thread:
http://www.opensolaris.org/jive/thread.jspa?messageID=46992
Post by Dmitry
Post by Jürgen Keil
That could also be the explanation that the machine
sometimes hangs when you test oracle.
how to test it, or resolve?
Is the hang with oracle reproducable? Does is always hang when you
run a certain database command? For example when you have
tablespace allocated to files on a logging ufs filesystem, and you
drop such a file (= remove a big file from a logging ufs filesystem) [*] ?

What kind of machine is that DL585, how many cpus does it have?
Single or dual core?


[*]
Bug 6302747
http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6302747

Bug 6251659
http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6251659


This message posted from opensolaris.org
Dmitry
2006-07-16 11:28:45 UTC
Permalink
Post by Jürgen Keil
Is the hang with oracle reproducable? Does is always
hang when you
run a certain database command? For example when you
have
tablespace allocated to files on a logging ufs
filesystem, and you
drop such a file (= remove a big file from a logging
ufs filesystem) [*] ?
now i try to start Oracle and again start create big tablespaces (last time interrupted by hang)
Post by Jürgen Keil
What kind of machine is that DL585, how many cpus
does it have?
Single or dual core?
DL585 4 x dual core Opteron 2.4G 32G RAM


This message posted from opensolaris.org
Dana H. Myers
2006-07-16 16:49:31 UTC
Permalink
Post by Jürgen Keil
There have been issues in the past where the bios acpi code
was sending lots of (>10000) interrupts/second.
http://www.opensolaris.org/jive/thread.jspa?messageID=46992
It is very unlikely that Dmitry is seeing this problem with an
HP DL585; the issue reported above only occurs on W1100z/W2100z
with certain BIOS versions and is the result of poorly-coded
BIOS code.

Dana
Jürgen Keil
2006-07-16 11:03:14 UTC
Permalink
Post by Dmitry
[b]# echo ::interrupts | mdb -k
mdb: invalid command '::interrupts': unknown dcmd
name[/b]
Another way to print the interrupt vector table is to use
Dan Mick's autovec utility. See:

http://groups.yahoo.com/group/solarisx86/message/29846

(The makefile and the source probably needs a few minor
tweaks to build it as a 64-bit executable on current S10 U2)

Btw. the above yahoo groups thread is about an issue with
acpi and > 30000 interrupts/second. I think the issue is
fixed on current opensolaris now, but I'm unsure if the fix has
been backported to S10 U2.


This message posted from opensolaris.org
Dana H. Myers
2006-07-16 16:53:54 UTC
Permalink
Post by Jürgen Keil
Post by Dmitry
Solaris 10 06/06 x86 HP DL585 boot hang aftrer reboot
when I shutdown server and power on,
solaris sometimes start, and working good,
sometime hang
That is, if you power cycle the machine, it will boot without hanging,
at least sometimes?
One idea is that we have two or more devices sharing the interrupt vector
that acpi is using, and we somehow have a pending interrupt on one of the
other devices, when the acpi interrupt handler is installed.
Try to boot the machine; when the machine is running you can print the
echo ::interrupts | mdb -k
(I hope the ::interrupts command is supported by S10 6/2006)
acpi_wrapper_isr is the interrupt handler for acpi; is there another
device sharing the same vector?
The other idea is that there's a problem with the solaris acpi module
when used on that hp dl585 machine. Maybe the acpi bios table on
that hp machine sends some acpi events that Solaris doesn't understand?
That could also be the explanation that the machine sometimes hangs
when you test oracle.
These are good suggestions, but apparently Dmitri has booted with
acpi-user-options=0x8 and acpi-user-options=0x2 and the hang still
occurs (Dmitry, please correct me if I'm wrong). Since acpica is
loaded early, along with pci_autoconfig and the PSM modules, it is
apparently easy to fixate on acpica as a source of trouble when there's
plenty of other things going on in the system; I'd suggest investigating
whether this is another instance of CR 6401605, which is unrelated to ACPI. See:

http://blogs.sun.com/roller/page/anish?entry=x64_solaris_installation_could_fail

Dana
Dmitry
2006-07-16 12:39:41 UTC
Permalink
i havn't experience and compiler for remake autovec

can i exlude acpica module?
in /etc/system
exclude: acpica

what consequences for me?


This message posted from opensolaris.org
Jürgen Keil
2006-07-16 14:39:38 UTC
Permalink
Post by Dmitry
i havn't experience and compiler for remake autovec
A compiler should be available as /usr/sfw/bin/gcc, if you've installed
everything from the OS installation CD/DVD.

You can also use the 32-bit autovec executable that is included
in Dan Mick's source archive (so no recompile is needed)
ftp://playground.sun.com/pub/dmick/autovec.tar.Z

To use the 32-bit executable, you have to boot your opteron box in
32-bit mode, by editing the kernel boot command line in grub:
add the 32-bit kernel name "kernel/unix " at the end of the
"kernel /platform/i86pc/multiboot " line.
Post by Dmitry
can i exlude acpica module?
in /etc/system
exclude: acpica
what consequences for me?
Hmm, your attempt to disable Solaris acpi with the
"-B acpi-user-options=0x2" option should have worked.

http://cvs.opensolaris.org/source/xref/on/usr/src/uts/i86pc/io/acpica/acpica.c#acpica_init

353 acpica_init()
354 {
355 ACPI_STATUS status;
356
357 /*
358 * Make sure user options are processed,
359 * then fail to initialize if ACPI CA has been
360 * disabled
361 */
362 acpica_process_user_options();
363 if (!acpica_enable)
364 return (AE_ERROR);
....



Are you 100% sure there weren't any typos with the
"acpi-user-options" parameter when you tested that?


This message posted from opensolaris.org
Dana H. Myers
2006-07-16 16:56:25 UTC
Permalink
Post by Jürgen Keil
Post by Dmitry
i havn't experience and compiler for remake autovec
A compiler should be available as /usr/sfw/bin/gcc, if you've installed
everything from the OS installation CD/DVD.
You can also use the 32-bit autovec executable that is included
in Dan Mick's source archive (so no recompile is needed)
ftp://playground.sun.com/pub/dmick/autovec.tar.Z
To use the 32-bit executable, you have to boot your opteron box in
add the 32-bit kernel name "kernel/unix " at the end of the
"kernel /platform/i86pc/multiboot " line.
Post by Dmitry
can i exlude acpica module?
in /etc/system
exclude: acpica
what consequences for me?
At the very minimum, you'll have only one CPU available.
Interrupts will probably not be correctly routed; some devices
will not work.
Post by Jürgen Keil
Hmm, your attempt to disable Solaris acpi with the
"-B acpi-user-options=0x2" option should have worked.
http://cvs.opensolaris.org/source/xref/on/usr/src/uts/i86pc/io/acpica/acpica.c#acpica_init
353 acpica_init()
354 {
355 ACPI_STATUS status;
356
357 /*
358 * Make sure user options are processed,
359 * then fail to initialize if ACPI CA has been
360 * disabled
361 */
362 acpica_process_user_options();
363 if (!acpica_enable)
364 return (AE_ERROR);
....
Are you 100% sure there weren't any typos with the
"acpi-user-options" parameter when you tested that?
It is perhaps possible the issue has nothing to do with acpica :-) ?

Dana
Jean-François Ndi
2006-07-16 17:49:17 UTC
Permalink
Hi Dmitry.

Could you perform the following test?

- Boot -kvd
- At the debugger prompt
1) ::bp npe`_init
2) ::bp acpica`_init
3) :c

What is the breakpoint you hit first?

When you hit the npe breakpoint:
1) pcie_error_disable_flag/W 1
2) :c

When you hit the acpica breakpoint:
1) :c

Do you see the PCIE root nexus driver getting loaded (npe)?

Normally, the effect of 6401605 is an unconditional freeze (memory access at base address 0).

Could you, please verify?

Thanks.

Regards,

J-F


This message posted from opensolaris.org
Dmitry
2006-07-17 04:42:11 UTC
Permalink
"npe" not present


This message posted from opensolaris.org
Jean-François Ndi
2006-07-17 05:47:56 UTC
Permalink
Hi Dmitry,
Post by Dmitry
"npe" not present
So as I expected 6401605 is not involved here.

Thanks,

back to my ACPI study ;-)

Regards,

J-F


This message posted from opensolaris.org
Dmitry
2006-07-17 05:11:46 UTC
Permalink
i don't understand it :) (i run it in 32-bit mode )
# ./autovec
[1]
i8042_intr() pri 0
[4]
<stale entry>
[6]
fdc_intr() pri 0
[9]
acpi_wrapper_isr() pri 0
[12]
i8042_intr() pri 0
[14]
ata_intr() pri 0
[18]
cpqary3_hw_isr() pri 0
[25]
bge_intr() pri 0
[32]
mpt_intr() pri 0
[33]
mpt_intr() pri 0
[36]
mpt_intr() pri 0
[37]
mpt_intr() pri 0
[192]
xc_serv() pri 0
[208]
kcpc_hw_overflow_intr() pri 0
[209]
cbe_fire() pri 0
[210]
cbe_fire() pri 0
[224]
xc_serv() pri 0
[225]
apic_error_intr() pri 0


This message posted from opensolaris.org
Jürgen Keil
2006-07-17 08:27:31 UTC
Permalink
Post by Dmitry
i don't understand it :) (i run it in 32-bit mode )
# ./autovec
...
Post by Dmitry
[9]
acpi_wrapper_isr() pri 0
So, my theory that the hang could be caused by shared interrupts is wrong;
there is exactly one interrupt handler registered for vector 9.

I also had a closer look at the acpica initialization code. The last message
that you see when running with "moddebug/W 80000000" ....

installing acpica, module id 11.


is printed immediately before be run the _init() function from the acpica
kernel module. jfndi is right: the code in the _init() function does not care
about "acpi-user-options".

The code in acpica`_init() also does not yet install interrupt handlers. That happens
a bit later.


This message posted from opensolaris.org
Dmitry
2006-07-17 08:45:57 UTC
Permalink
can i use this method for debug acpica at boot?

http://blogs.sun.com/roller/page/danasblog?entry=solaris_acpi_ca_debug_output

I try it today later


This message posted from opensolaris.org
Jean-François Ndi
2006-07-17 09:17:45 UTC
Permalink
As far as I can see, the problem seems to occur during the pci device tree creation.

Maybe adding:

set pci_autoconfig:pci_boot_debug=1

to your /etc/system will provide more info about when the hang occurs.

Regards,

J-F


This message posted from opensolaris.org
Dmitry
2006-07-17 12:50:32 UTC
Permalink
i try to set
set acpica:acpica_muzzle_debug_output=0
set acpica:AcpiDbgLevel=0x7
set acpica:acpica_console_out=1

please see screenshot 3 in the first message
may be increase AcpiDbgLevel?


This message posted from opensolaris.org
Jean-François Ndi
2006-07-17 13:21:07 UTC
Permalink
From what I see in the screenshot the hang occurs during the pci bus probing.
I think it hangs in the following function:

http://cvs.opensolaris.org/source/xref/on/usr/src/uts/i86pc/io/pci/pci_boot.c#enumerate_bus_devs

just after the call of:

http://cvs.opensolaris.org/source/xref/on/usr/src/uts/i86pc/io/pci/pci_boot.c#process_devfunc

I am not sure yet.

Regards,

J-F


This message posted from opensolaris.org
Dana H. Myers
2006-07-17 14:18:58 UTC
Permalink
Post by Jean-François Ndi
From what I see in the screenshot the hang occurs during the pci bus probing.
I haven't seen a screenshot - did Dmitri send one? Could you forward it
to me?
Post by Jean-François Ndi
http://cvs.opensolaris.org/source/xref/on/usr/src/uts/i86pc/io/pci/pci_boot.c#enumerate_bus_devs
http://cvs.opensolaris.org/source/xref/on/usr/src/uts/i86pc/io/pci/pci_boot.c#process_devfunc
I am not sure yet.
This would tend to support my suspicion we're tripping over a PCI bus
error or incompletely configured PCI-PCI bridge.

Thanks -
Dana
Dmitry
2006-07-17 14:28:23 UTC
Permalink
i don't now, i'm muzzy
now i'll try to boot
with difeerent options
reboot -- -B acpi-user-options=


This message posted from opensolaris.org
Dana H. Myers
2006-07-17 15:34:06 UTC
Permalink
Post by Dmitry
i don't now, i'm muzzy
now i'll try to boot
with difeerent options
reboot -- -B acpi-user-options=
I would suggest not changing the default for acpi-user-options; the
evidence indicates that the issue is during PCI enumeration.

When the system *does* boot, would you be so kind to run:

/usr/X11/bin/scanpci -v

(as root) and send me the results?

Thanks -
Dana
Jürgen Keil
2006-07-17 13:28:39 UTC
Permalink
Post by Dmitry
i try to set
set acpica:acpica_muzzle_debug_output=0
set acpica:AcpiDbgLevel=0x7
set acpica:acpica_console_out=1
please see screenshot 3 in the first message
may be increase AcpiDbgLevel?
I think AcpiDbgLevel=0x3fff is the default that is compiled into the acpica
module, and should print lots of information.

But... You also added "set pci_autoconfig:pci_boot_debug=1" ?

The "PCI Hot-Plug" and the "NOTICE messages are not from the acpica code.
So, now it is looking more and more as if acpica isn't responsible for the hang,
but the pci bus enumeration code.

Quote from Dana...
Post by Dmitry
It is perhaps possible the issue has nothing to do with acpica :-) ?
This message posted from opensolaris.org
Dana H. Myers
2006-07-17 14:13:08 UTC
Permalink
Jürgen Keil wrote:
[...]
Post by Jürgen Keil
But... You also added "set pci_autoconfig:pci_boot_debug=1" ?
The "PCI Hot-Plug" and the "NOTICE messages are not from the acpica code.
So, now it is looking more and more as if acpica isn't responsible for the hang,
but the pci bus enumeration code.
Quote from Dana...
Post by Dana H. Myers
It is perhaps possible the issue has nothing to do with acpica :-) ?
Heh; I also wrote:

"Since acpica is loaded early, along with pci_autoconfig and the PSM
modules, it is apparently easy to fixate on acpica as a source of trouble
when there's plenty of other things going on in the system;"

I've been suspecting a PCI enumeration issue, and it made me think of
the PCI-e error handling problem. Apparently we do not have a precise
duplicate of CR 6401605, but I'm suspecting that PCI enumeration trips
over a bus error of some kind.

Dana
Jürgen Keil
2006-07-17 14:34:41 UTC
Permalink
Post by Dana H. Myers
I've been suspecting a PCI enumeration issue, and it made me think of
the PCI-e error handling problem. Apparently we do not have a precise
duplicate of CR 6401605, but I'm suspecting that PC enumeration trips
over a bus error of some kind.
Is the fix for CR 6401605 included in Solaris 10 06/06 x86 ?


This message posted from opensolaris.org
James Carlson
2006-07-17 15:07:49 UTC
Permalink
Post by Jürgen Keil
Post by Dana H. Myers
I've been suspecting a PCI enumeration issue, and it made me think of
the PCI-e error handling problem. Apparently we do not have a precise
duplicate of CR 6401605, but I'm suspecting that PC enumeration trips
over a bus error of some kind.
Is the fix for CR 6401605 included in Solaris 10 06/06 x86 ?
No. It looks like the fix is in progress for s10u3 (the next update).
--
James Carlson, KISS Network <james.d.carlson-xsfywfwIY+***@public.gmane.org>
Sun Microsystems / 1 Network Drive 71.232W Vox +1 781 442 2084
MS UBUR02-212 / Burlington MA 01803-2757 42.496N Fax +1 781 442 1677
Rob McMahon
2006-07-17 15:03:14 UTC
Permalink
For another datapoint, I'm seeing what looks like exactly the same problem on a DL385 (single AMD Opteron 280). Setting the appropriate flags in /etc/system, the hang ends with:

Search PCI Hot-Plug Resource Table starting at 0xF0000
Found PCI Hot-Plug Resource Table at f4ee0
No. of PCI hot-plug slot entries = 0x0
Found MP Floating Pointer Structure at f4fa0
NOTICE: enumerating bug 0x0
NOTICE: probing dev 0x0, func 0x0
NOTICE: probing dev 0x1, func 0x0
NOTICE: probing dev 0x2, func 0x0
NOTICE: probing dev 0x3, func 0x0
NOTICE: bus 1 io-tange: 0x4000-4fff
NOTICE: bus 1 mem-range 0xf5f00000-f7dfffff

(Copied to paper, and typed in by hand since I'm struggling to get the machine back.)


This message posted from opensolaris.org
Rob McMahon
2006-07-17 15:07:08 UTC
Permalink
Ah, I should have said this is Solaris 10U2, not the community or express editions.


This message posted from opensolaris.org
Dana H. Myers
2006-07-17 15:32:51 UTC
Permalink
Post by Rob McMahon
Search PCI Hot-Plug Resource Table starting at 0xF0000
Found PCI Hot-Plug Resource Table at f4ee0
No. of PCI hot-plug slot entries = 0x0
Found MP Floating Pointer Structure at f4fa0
NOTICE: enumerating bug 0x0
NOTICE: probing dev 0x0, func 0x0
NOTICE: probing dev 0x1, func 0x0
NOTICE: probing dev 0x2, func 0x0
NOTICE: probing dev 0x3, func 0x0
NOTICE: bus 1 io-tange: 0x4000-4fff
NOTICE: bus 1 mem-range 0xf5f00000-f7dfffff
(Copied to paper, and typed in by hand since I'm struggling to get the machine back.)
Ah. This does indeed look like Dmitry's issue.

According to the HCL, Solaris has been testing on the DL585 at least
once; this makes me curious what BIOS revision these two machines
(DL385 and DL585) are running. I believe you can get the BIOS version
by entering BIOS setup-mode.

Thanks -
Dana
Dana H. Myers
2006-07-17 15:57:42 UTC
Permalink
Post by Dana H. Myers
According to the HCL, Solaris has been testing on the DL585 at least
once; this makes me curious what BIOS revision these two machines
(DL385 and DL585) are running. I believe you can get the BIOS version
by entering BIOS setup-mode.
I believe the initial BIOS was 12/12/05. When I started having the
problems, I updated it to
*HP ProLiant DL385 system ROM A05 (03/01/2006)
**(By which I suspect it means 1st March, rather than 3rd January.)
I'm pretty sure it's in Yank format ;-) (Living a year in the UK
has left me writing '1 Mar 2006' instead).
I
saw no difference. BTW, now that I've got it back, the sequence logged
on a successful reboot is
Jul 17 16:09:33 ux-019-2 pci_autoconfig: [ID 536860 kern.notice] search
PCI Hot-Plug Resource Table starting at 0xF0000
Jul 17 16:09:33 ux-019-2 pci_autoconfig: [ID 288500 kern.notice] Found
PCI Hot-Plug Resource Table at f4ee0
Jul 17 16:09:33 ux-019-2 pci_autoconfig: [ID 289203 kern.notice] No. of
PCI hot-plug slot entries = 0x0
Jul 17 16:09:33 ux-019-2 pci_autoconfig: [ID 165270 kern.notice] Found
MP Floating Pointer Structure at f4fa0
enumerating pci bus 0x0
probing dev 0x0, func 0x0
probing dev 0x1, func 0x0
probing dev 0x2, func 0x0
probing dev 0x3, func 0x0
bus 1 io-range: 0x4000-4fff
bus 1 mem-range: 0xf5f00000-f7dfffff
<<< this is where it hangs when it fails >>>
probing dev 0x4, func 0x0
Could you run '/usr/X11/scanpci -v' and send me the output? I suspect
bus 0, dev 3 is a PCI-PCI bridge and the hang occurs during enumeration
of the bridge.

Thanks -
Dana
Dmitry
2006-07-17 16:13:37 UTC
Permalink
YES it's mine,
i have this bug with d585 and dl385 (i have both)
it has latest firmware
DL585 (A01) Servers version 2006.03.22 A (13 Apr 06)
DL385 (A05) Servers version 2006.03.01 (28 Mar 06)

tomorrow morning i'll continue testing, but i don't have many time

i belive you don't leave me alone with my problem.
thanks to all


This message posted from opensolaris.org
Rob McMahon
2006-07-17 15:52:04 UTC
Permalink
Post by Dana H. Myers
According to the HCL, Solaris has been testing on the DL585 at least
once; this makes me curious what BIOS revision these two machines
(DL385 and DL585) are running. I believe you can get the BIOS version
by entering BIOS setup-mode.
I believe the initial BIOS was 12/12/05. When I started having the
problems, I updated it to

*HP ProLiant DL385 system ROM A05 (03/01/2006)

**(By which I suspect it means 1st March, rather than 3rd January.) I
saw no difference. BTW, now that I've got it back, the sequence logged
on a successful reboot is

Jul 17 16:09:33 ux-019-2 pci_autoconfig: [ID 536860 kern.notice] search
PCI Hot-Plug Resource Table starting at 0xF0000
Jul 17 16:09:33 ux-019-2 pci_autoconfig: [ID 288500 kern.notice] Found
PCI Hot-Plug Resource Table at f4ee0
Jul 17 16:09:33 ux-019-2 pci_autoconfig: [ID 289203 kern.notice] No. of
PCI hot-plug slot entries = 0x0
Jul 17 16:09:33 ux-019-2 pci_autoconfig: [ID 165270 kern.notice] Found
MP Floating Pointer Structure at f4fa0
Jul 17 16:09:33 ux-019-2 pci_autoconfig: [ID 732373 kern.notice] NOTICE:
enumerating pci bus 0x0
Jul 17 16:09:33 ux-019-2 pci_autoconfig: [ID 496449 kern.notice] NOTICE:
probing dev 0x0, func 0x0
Jul 17 16:09:33 ux-019-2 pci_autoconfig: [ID 496449 kern.notice] NOTICE:
probing dev 0x1, func 0x0
Jul 17 16:09:33 ux-019-2 pci_autoconfig: [ID 496449 kern.notice] NOTICE:
probing dev 0x2, func 0x0
Jul 17 16:09:33 ux-019-2 pci_autoconfig: [ID 496449 kern.notice] NOTICE:
probing dev 0x3, func 0x0
Jul 17 16:09:33 ux-019-2 pci_autoconfig: [ID 576968 kern.notice] NOTICE:
bus 1 io-range: 0x4000-4fff
Jul 17 16:09:33 ux-019-2 pci_autoconfig: [ID 166801 kern.notice] NOTICE:
bus 1 mem-range: 0xf5f00000-f7dfffff
<<< this is where it hangs when it fails >>>
Jul 17 16:09:33 ux-019-2 pci_autoconfig: [ID 496449 kern.notice] NOTICE:
probing dev 0x4, func 0x0

Anything I can do to help resolve this problem, please let me know.

Cheers,

Rob
*
--
E-Mail: Rob.McMahon-***@public.gmane.org PHONE: +44 24 7652 3037
Rob McMahon, IT Services, Warwick University, Coventry, CV4 7AL, England
Dana H. Myers
2006-07-19 02:13:31 UTC
Permalink
How much memory do you have, and how often does the machine hang on boot?

I've got a DL585 running S10U2 here now, but it only has 2GB of RAM.
I've rebooted a few time without a hang.

Dana
Dmitry
2006-07-19 04:37:37 UTC
Permalink
32G ram 4x dual Opteron

often, every time after (reboot, init 6)
sometime on power on


This message posted from opensolaris.org
Rob McMahon
2006-07-19 09:31:56 UTC
Permalink
Post by Dana H. Myers
How much memory do you have, and how often does the machine hang on boot?
The machines I suffering with all have 3GB memory. I would say they
fail one time in three, or there abouts. It depends whether I want them
to fail or not :-) Sometimes one of them seems to be in a good mood,
and you have to move onto the next to trap a failure. I'd say it was
irrelevant whether they were rebooted with init 6 or power-cycled: I've
seen failures both ways. I've also seen a hang while PXE booting, but
only once (but there again I haven't done it that often).
Post by Dana H. Myers
I've got a DL585 running S10U2 here now, but it only has 2GB of RAM.
I've rebooted a few time without a hang.
Rats. I've 5 DL385's and 1 DL380 with one or two AMD Opteron 280's
(385s) or two Intel Xeon's (380) all failing. They all have Smart Array
6i's with 3-6 disks: I wonder if this could be contributing ?

Cheers,

Rob
--
E-Mail: Rob.McMahon-***@public.gmane.org PHONE: +44 24 7652 3037
Rob McMahon, IT Services, Warwick University, Coventry, CV4 7AL, England
Dmitry
2006-07-18 17:40:49 UTC
Permalink
today dl585 hang at work,
but it only copied 1 big file from network,


This message posted from opensolaris.org
Dana H. Myers
2006-07-18 17:52:04 UTC
Permalink
Post by Dmitry
today dl585 hang at work,
but it only copied 1 big file from network,
So, the system hangs not only during boot, but sometimes
long after boot?
Jean-François Ndi
2006-07-19 06:19:44 UTC
Permalink
Hello, Dana.

This problem seems to be solved under Nevada (MASTER_ABORT followed by an NMI in short).

pci _enumerate(), process_devfunc() have been modified in order to fix certain pci devices (in fact one) before the enumeration.

http://cvs.opensolaris.org/source/xref/on/usr/src/uts/i86pc/io/pci/pci_boot.c#pci_fix_amd8111:

Surprise... This is exactly the one that seems to cause the problem.

pci bus 0x0000 cardnum 0x03 function 0x00: vendor 0x1022 device 0x7460
Advanced Micro Devices [AMD] AMD-8111 PCI

Am I wrong?

Regards,

J-F


This message posted from opensolaris.org
Jean-François Ndi
2006-07-19 11:12:09 UTC
Permalink
Post by Jean-François Ndi
pci bus 0x0000 cardnum 0x03 function 0x00: vendor
0x1022 device 0x7460
Advanced Micro Devices [AMD] AMD-8111 PCI
It seems that I am wrong this is not exactly the same device. Maybe a similar problem. I don't know.

Regards,

J-F


This message posted from opensolaris.org
Dana H. Myers
2006-07-19 17:51:01 UTC
Permalink
Post by Jean-François Ndi
Hello, Dana.
This problem seems to be solved under Nevada (MASTER_ABORT followed by an NMI in short).
pci _enumerate(), process_devfunc() have been modified in order to fix certain pci devices (in fact one) before the enumeration.
Surprise... This is exactly the one that seems to cause the problem.
pci bus 0x0000 cardnum 0x03 function 0x00: vendor 0x1022 device 0x7460
Advanced Micro Devices [AMD] AMD-8111 PCI
Am I wrong?
Well, it's possible it is a manifestation of the same issue, though the
symptoms are quite different. In the above fix, some revisions of the system
BIOS were incorrectly corrupting the PCI config space access register at 0xCF8
as a result of handling the PCI master abort. This would cause a device to
appear to go missing from config space. We're seeing a semi-random system hang here,
which is different - but it could be a result of the HP BIOS being different.

However, I'm not strongly suspecting this is the case on the HP machines.
Jean-François Ndi
2006-07-18 17:52:43 UTC
Permalink
Hi Dmitry.

Have performed the scanpci asked by Dana?

I would be interested to have a look at it.

Best regards,

J-F


This message posted from opensolaris.org
Dmitry
2006-07-18 19:02:07 UTC
Permalink
big text, i attach it at first message


This message posted from opensolaris.org
Dmitry
2006-07-19 05:42:21 UTC
Permalink
now i see hp.com

software and drivers
HP ProLiant DL585 Server, Sun Solaris 10 x86 Platform Edition
HP System Management Homepage for Solaris (x86) 10 > 2.0.0 [b]new 18 Jul 06[/b]

so it should be work
???


This message posted from opensolaris.org
Dmitry
2006-07-20 07:38:04 UTC
Permalink
I'm disappointed,
i'll change sysytem disks and install Windows 2003 x64, if it will work good, i'll take leave of Solaris :)
maybe it is too early for production


This message posted from opensolaris.org
Gino Ruopolo
2006-07-20 08:14:06 UTC
Permalink
Same problems here. DL585 with last firmware.
We have 14 units to upgrade to U2 as we need ZFS.
Now testing linux ... :(


This message posted from opensolaris.org
Vlad
2006-07-25 00:08:33 UTC
Permalink
I have DL385 after upgrade to Solaris 6/6 my server begun hang up after "init 6"or reboot. Every time after reboot I could se a message at RAID controler that some “data” is there and it needs to pushed ... Crap!!.. Power off always clean that cache or Crap. I fixed new BIOS firmware(BIOS Configuration: HP A05 03/01/2006) from HP and that helps me only a little, but every third time my server hung up. But I saw that before hang up, RAID tried access HD and then stop. I had some little info in my log likes “.. genunix: [ID 935449 kern.notice] ATA DMA off: disabled by prop ata-dma-enabled ..

So I put next to grub -> ata-dma-enabled=0, now it works.
I made 9 times reboot without “hang up”.

kernel /platform/i86pc/multiboot -v -B ata-dma-enabled=0

It means that HP DL385 has ACPI's major fuc* up error.


This message posted from opensolaris.org
Dana H. Myers
2006-07-25 05:15:00 UTC
Permalink
Post by Vlad
I have DL385 after upgrade to Solaris 6/6 my server begun hang up after "init 6"or reboot.
This is a known problem, for which I am preparing to integrate a fix into Solaris Nevada
(and into the next possible S10 update).

The hang is caused when, during PCI enumeration, a PCI-PCI bridge is partially
disabled when the PCI command register bits which enable IO and memory windows
are cleared.
Post by Vlad
Every time after reboot I could se a message at RAID controler that some “data” is there
and it needs to pushed ... Crap!!.. Power off always clean that cache or Crap. I fixed
new BIOS firmware(BIOS Configuration: HP A05 03/01/2006) from HP and that helps me only
a little, but every third time my server hung up.
But I saw that before hang up, RAID tried access HD and then stop.
I had some little info in my log likes
“.. genunix: [ID 935449 kern.notice] ATA DMA off: disabled by prop ata-dma-enabled ..
I believe the BIOS message about pushing data to the disk is innocuous.
Post by Vlad
So I put next to grub -> ata-dma-enabled=0, now it works.
I made 9 times reboot without “hang up”.
kernel /platform/i86pc/multiboot -v -B ata-dma-enabled=0
It means that HP DL385 has ACPI's major fuc* up error.
While this is interesting why do you conclude this has anything to do with ACPI?

In any case, my experience has been, after 1000+ reboots, that the RAID controller
pushes the data and Solaris is not impacted at all.

Dana

Loading...