Yep, dmesg was what helped me track it down On 6/27/2023 11:12 AM, Andreas Legner via IBM TechXchange Community wrote:
01000188fda0de26-cad81b4e-ed1a-45b7-ae67-0a0ba2666153-000000@email.amazonses.com"> oninit stacks, I'd say, in this case actually only a (sleeping) Informix thread's stack aren't that helpful. All you can glean from them is...
Re: kernel: INFO: task oninit:1720 blocked for more than 120 seconds. | | | oninit stacks, I'd say, in this case actually only a (sleeping) Informix thread's stack aren't that helpful. All you can glean from them is Informix is waiting on some IO, more specifically on an IO request submitted to the OS kernel. The actual problem occurs in the OS kernel, with an OS IO worker process/thread processing our request and being hung interacting with the file system. These kernel stacks, afair, are contained in the dmesg output.
------------------------------ Andreas Legner ------------------------------ | | Reply to Group Online View Thread Recommend Forward Flag as Inappropriate |
Original Message: Sent: Tue June 27, 2023 12:04 PM | |
| |
Original Message:
Sent: 6/27/2023 12:13:00 PM
From: Andreas Legner
Subject: RE: kernel: INFO: task oninit:1720 blocked for more than 120 seconds.
oninit stacks, I'd say, in this case actually only a (sleeping) Informix thread's stack aren't that helpful. All you can glean from them is Informix is waiting on some IO, more specifically on an IO request submitted to the OS kernel.
The actual problem occurs in the OS kernel, with an OS IO worker process/thread processing our request and being hung interacting with the file system. These kernel stacks, afair, are contained in the dmesg output.
------------------------------
Andreas Legner
------------------------------
Original Message:
Sent: Tue June 27, 2023 12:04 PM
From: Dennis Vinueza
Subject: kernel: INFO: task oninit:1720 blocked for more than 120 seconds.
Hi, yes, we tried that already. pstack does not return anything, but onstat -g stk returns something very similar as from the other post. But, as said, journalling is on. Any other ext4 settings I might be missing?
Stack for thread: 13 flush_sub(0)
base: 0x00000000cbfa7000
len: 69632
pc: 0x00000000014f95db
tos: 0x00000000cbfb7c10
state: sleeping
vp: 1
0x00000000014f95db (oninit) yield_processor_mvp
0x00000000014b88d9 (oninit) mt_aio_wait.part.9
0x00000000014af63e (oninit) mt_aio_start
0x0000000000eab72a (oninit) aclean_chunk
0x0000000000eaab17 (oninit) flush_sub
0x000000000150e5cf (oninit) startup
dumpe2fs /dev/sdb1 | grep journal
Filesystem features: has_journal ext_attr resize_inode dir_index filetype extent 64bit flex_bg sparse_super large_file huge_file dir_nlink extra_isize metadata_csum
------------------------------
Dennis Vinueza
Original Message:
Sent: Tue June 27, 2023 11:49 AM
From: Paul Watson
Subject: kernel: INFO: task oninit:1720 blocked for more than 120 seconds.
when it is hung track down the oninit process and look at the stack - might help
https://community.ibm.com/community/user/datamanagement/discussion/nasty-redhat-issue#bmd84e1b8a-caba-4140-ae34-722118dc650a
Cheers
Paul
On 6/27/2023 10:35 AM, Dennis Vinueza via IBM TechXchange Community wrote:
01000188fd7efd43-ebaef315-11ae-4b4b-b23e-b4e800ae862b-000000@email.amazonses.com"> At least that does not seem to be the only way to reproduce the issue. ext4 with journalling is now active but the problem still occurs after some... -posted to the "Informix" group
Original Message:
Sent: 6/27/2023 11:36:00 AM
From: Dennis Vinueza
Subject: RE: kernel: INFO: task oninit:1720 blocked for more than 120 seconds.
At least that does not seem to be the only way to reproduce the issue. ext4 with journalling is now active but the problem still occurs after some time.
------------------------------
Dennis Vinueza
Original Message:
Sent: Tue June 27, 2023 11:30 AM
From: Andreas Legner
Subject: kernel: INFO: task oninit:1720 blocked for more than 120 seconds.
I was under the impression that ext4 without journalling is the (unsupported) combination that's causing these hangs - is that not the case?
------------------------------
Andreas Legner
Original Message:
Sent: Tue June 27, 2023 06:59 AM
From: Dennis Vinueza
Subject: kernel: INFO: task oninit:1720 blocked for more than 120 seconds.
Hello everyone,
did anyone ever find out whether this problem is caused bei IDS or the OS? Having journaling enabled or not on OS level did not make any difference, for example.
@Hrvoje Zokovic had you had any problems of this kind ever since switching off DIRECT_IO ?
KR
Dennis
------------------------------
Dennis Vinueza
Original Message:
Sent: Sun November 06, 2022 05:09 AM
From: Hrvoje Zokovic
Subject: kernel: INFO: task oninit:1720 blocked for more than 120 seconds.
Thanks David,
customer did not notice any performance degradation after turning DIRECT_IO off - so we'll leave it turned off.
I'm leaving this discussion to Informix and Red Hat owners (or owner) to deal or not to deal wit it.
But I'll have this remembered for future use :)
Ty
Hrvoje
------------------------------
Hrvoje Zokovic
Original Message:
Sent: Sat November 05, 2022 07:46 PM
From: David Williams
Subject: kernel: INFO: task oninit:1720 blocked for more than 120 seconds.
Hi,
io_submit can block for many reasons -
https://stackoverflow.com/questions/34572559/asynchronous-io-io-submit-latency-in-ubuntu-linuxIn this case it appears to be an unaligned write e.g. writing to the 2nd 2K Informix page on a 4K ext4 block.
https://elixir.bootlin.com/linux/v4.18/source/fs/ext4/file.c#L247:* Unaligned direct AIO must be serialized among each other as zeroing
* of partial blocks of two competing unaligned AIOs can result in data
* corruption."
https://elixir.bootlin.com/linux/v4.18/source/fs/ext4/file.c#L106 ext4_unwritten_wait
This the calls schedule which then invokes the block unplugging mechanism
https://lwn.net/Articles/438256/i.e. have the process action all pending i/o's!
Looks like the device took 120s to action all the pending i/os.
From the comments it is unclear if it needs to submit or action the i/os!
You could have been hitting the limits in
https://stackoverflow.com/questions/34572559/asynchronous-io-io-submit-latency-in-ubuntu-linux item 4
/sys/block/[disk]/queue/nr_request
/sys/block/[disk]/device/queue_depth
or something else.
It is something for however supports your OS to look into.
You can also try the stuff in
https://tanelpoder.com/posts/11m-iops-with-10-ssds-on-amd-threadripper-pro-workstation/and use
- sudo psn -G syscall,wchan,kstack | sed 's/->/\n/g'
-
0x.tools / run_xcpu.sh runs perf record / sudo perf report -i
to get thread stack and see exactly where iin the kernel the time is being spent!
David.
------------------------------
David Williams
Original Message:
Sent: Tue March 08, 2022 04:14 AM
From: Hrvoje Zokovic
Subject: kernel: INFO: task oninit:1720 blocked for more than 120 seconds.
Anyone has experience/solution with this:
Feb 14 14:28:46 dbpri kernel: INFO: task oninit:1720 blocked for more than 120 seconds.
Feb 14 14:28:46 dbpri kernel: Not tainted 4.18.0-348.12.2.el8_5.x86_64 #1
Feb 14 14:28:46 dbpri kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Feb 14 14:28:46 dbpri kernel: task:oninit state:D stack: 0 pid: 1720 ppid: 1714 flags:0x00000084
Feb 14 14:28:46 dbpri kernel: Call Trace:
Feb 14 14:28:46 dbpri kernel: __schedule+0x2bd/0x760
Feb 14 14:28:46 dbpri kernel: ? blk_flush_plug_list+0xd7/0x100
Feb 14 14:28:46 dbpri kernel: schedule+0x37/0xa0
Feb 14 14:28:46 dbpri kernel: ext4_unwritten_wait+0x83/0xb0 [ext4]
Feb 14 14:28:46 dbpri kernel: ? finish_wait+0x80/0x80
Feb 14 14:28:46 dbpri kernel: ext4_file_write_iter+0x392/0x3e0 [ext4]
Feb 14 14:28:46 dbpri kernel: aio_write+0xf6/0x1c0
Feb 14 14:28:46 dbpri kernel: ? blk_mq_sched_insert_requests+0x71/0xf0
Feb 14 14:28:46 dbpri kernel: ? io_submit_one+0x7e/0x3c0
Feb 14 14:28:46 dbpri kernel: ? kmem_cache_alloc+0x12e/0x270
Feb 14 14:28:46 dbpri kernel: io_submit_one+0x131/0x3c0
Feb 14 14:28:46 dbpri kernel: __x64_sys_io_submit+0xa2/0x180
Feb 14 14:28:46 dbpri kernel: ? syscall_trace_enter+0x1d3/0x2c0
Feb 14 14:28:46 dbpri kernel: do_syscall_64+0x5b/0x1a0
Feb 14 14:28:46 dbpri kernel: entry_SYSCALL_64_after_hwframe+0x65/0xca
Feb 14 14:28:46 dbpri kernel: RIP: 0033:0x7fbb9d46652d
Feb 14 14:28:46 dbpri kernel: Code: Unable to access opcode bytes at RIP 0x7fbb9d466503.
Feb 14 14:28:46 dbpri kernel: RSP: 002b:00000000500d7da8 EFLAGS: 00000246 ORIG_RAX: 00000000000000d1
Feb 14 14:28:46 dbpri kernel: RAX: ffffffffffffffda RBX: 00007fbb9ecf7280 RCX: 00007fbb9d46652d
Feb 14 14:28:46 dbpri kernel: RDX: 00000000500d7de8 RSI: 0000000000000001 RDI: 00007fbb98207000
Feb 14 14:28:46 dbpri kernel: RBP: 00007fbb98207000 R08: 0000000002229800 R09: 00000000538288d0
Feb 14 14:28:46 dbpri kernel: R10: ffffffffffffffe0 R11: 0000000000000246 R12: 0000000000000001
Feb 14 14:28:46 dbpri kernel: R13: 0000000000000000 R14: 00000000500d7de8 R15: 0000000053828850
It is Red Hat Enterprise Linux 8.5 (Ootpa)
Linux dbpri 4.18.0-348.12.2.el8_5.x86_64 #1 SMP Mon Jan 17 07:06:06 EST 2022 x86_64 x86_64 x86_64 GNU/Linux
I opened support case but no answer yet
Ofc I do not want to disable this message :)
Hrvoje
------------------------------
Hrvoje Zokovic
------------------------------
#Informix