Informix

 View Only
Expand all | Collapse all

kernel: INFO: task oninit:1720 blocked for more than 120 seconds.

  • 1.  kernel: INFO: task oninit:1720 blocked for more than 120 seconds.

    IBM Champion
    Posted Tue March 08, 2022 04:15 AM
    Edited by System Fri January 20, 2023 04:44 PM
    Anyone has experience/solution with this:
    Feb 14 14:28:46 dbpri kernel: INFO: task oninit:1720 blocked for more than 120 seconds.
    Feb 14 14:28:46 dbpri kernel:   Not tainted 4.18.0-348.12.2.el8_5.x86_64 #1
    Feb 14 14:28:46 dbpri kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    Feb 14 14:28:46 dbpri kernel: task:oninit     state:D stack:  0 pid: 1720 ppid: 1714 flags:0x00000084
    Feb 14 14:28:46 dbpri kernel: Call Trace:
    Feb 14 14:28:46 dbpri kernel: __schedule+0x2bd/0x760
    Feb 14 14:28:46 dbpri kernel: ? blk_flush_plug_list+0xd7/0x100
    Feb 14 14:28:46 dbpri kernel: schedule+0x37/0xa0
    Feb 14 14:28:46 dbpri kernel: ext4_unwritten_wait+0x83/0xb0 [ext4]
    Feb 14 14:28:46 dbpri kernel: ? finish_wait+0x80/0x80
    Feb 14 14:28:46 dbpri kernel: ext4_file_write_iter+0x392/0x3e0 [ext4]
    Feb 14 14:28:46 dbpri kernel: aio_write+0xf6/0x1c0
    Feb 14 14:28:46 dbpri kernel: ? blk_mq_sched_insert_requests+0x71/0xf0
    Feb 14 14:28:46 dbpri kernel: ? io_submit_one+0x7e/0x3c0
    Feb 14 14:28:46 dbpri kernel: ? kmem_cache_alloc+0x12e/0x270
    Feb 14 14:28:46 dbpri kernel: io_submit_one+0x131/0x3c0
    Feb 14 14:28:46 dbpri kernel: __x64_sys_io_submit+0xa2/0x180
    Feb 14 14:28:46 dbpri kernel: ? syscall_trace_enter+0x1d3/0x2c0
    Feb 14 14:28:46 dbpri kernel: do_syscall_64+0x5b/0x1a0
    Feb 14 14:28:46 dbpri kernel: entry_SYSCALL_64_after_hwframe+0x65/0xca
    Feb 14 14:28:46 dbpri kernel: RIP: 0033:0x7fbb9d46652d
    Feb 14 14:28:46 dbpri kernel: Code: Unable to access opcode bytes at RIP 0x7fbb9d466503.
    Feb 14 14:28:46 dbpri kernel: RSP: 002b:00000000500d7da8 EFLAGS: 00000246 ORIG_RAX: 00000000000000d1
    Feb 14 14:28:46 dbpri kernel: RAX: ffffffffffffffda RBX: 00007fbb9ecf7280 RCX: 00007fbb9d46652d
    Feb 14 14:28:46 dbpri kernel: RDX: 00000000500d7de8 RSI: 0000000000000001 RDI: 00007fbb98207000
    Feb 14 14:28:46 dbpri kernel: RBP: 00007fbb98207000 R08: 0000000002229800 R09: 00000000538288d0
    Feb 14 14:28:46 dbpri kernel: R10: ffffffffffffffe0 R11: 0000000000000246 R12: 0000000000000001
    Feb 14 14:28:46 dbpri kernel: R13: 0000000000000000 R14: 00000000500d7de8 R15: 0000000053828850

    It is Red Hat Enterprise Linux 8.5 (Ootpa)
    Linux dbpri 4.18.0-348.12.2.el8_5.x86_64 #1 SMP Mon Jan 17 07:06:06 EST 2022 x86_64 x86_64 x86_64 GNU/Linux

    I opened support case but no answer yet
    Ofc I do not want to disable this message :)

    Hrvoje

    ------------------------------
    Hrvoje Zokovic
    ------------------------------
    #Informix


  • 2.  RE: kernel: INFO: task oninit:1720 blocked for more than 120 seconds.

    Posted Fri October 28, 2022 09:03 AM
    Hello Hrvoje Zokovic,

    I am also facing the same issue. Our Informix instance was in hung state, twice in last 24 Hrs and same message in OS log file.

    Current Informix version is IBM Informix Dynamic Server Version 14.10.FC8WE and 
    Os Version
    Linux version 4.18.0-372.26.1.el8_6.x86_64 (mockbuild@x86-vm-08.build.eng.bos.redhat.com) (gcc version 8.5.0 20210514 (Red Hat 8.5.0-10) (GCC)) #1 SMP Sat Aug 27 02:44:20 EDT 2022


    Did you find any solution of this error?


    regards
    Tarun Batra

    ------------------------------
    Tarun Batra
    ------------------------------



  • 3.  RE: kernel: INFO: task oninit:1720 blocked for more than 120 seconds.

    IBM Champion
    Posted Fri October 28, 2022 09:14 AM
    I'd say Informix submitting a kaio request to the OS, and this submission library call hung in the kernel - probably a problem on OS level.
    Would this ever recover by itself, or is it hung forever and requires an Informix or even OS reboot?

    ------------------------------
    Andreas Legner
    ------------------------------



  • 4.  RE: kernel: INFO: task oninit:1720 blocked for more than 120 seconds.

    Posted Fri October 28, 2022 09:55 AM
    It was hung forever and we rebooted the machine

    Tarun Batra

    ------------------------------
    Tarun Batra
    ------------------------------



  • 5.  RE: kernel: INFO: task oninit:1720 blocked for more than 120 seconds.

    IBM Champion
    Posted Fri October 28, 2022 09:32 AM
    Hi
    we turned off DIRECT_IO and since then no problem.
    This was support conclusion:

    I might be a good idea to try with some other DIRECT_IO setting.

    Especially when you see issue more often.

    But also would be interesting to get an information from the OS support if they can find out what is blocking oninit.

    From an Informix side we can check stacks of the cpu VP with the running KAIO. With pstack/procstack.

    But very likely only confirming that is hanging in an IO request.


    HTH
    Hrvoje

    ------------------------------
    Hrvoje Zokovic
    ------------------------------



  • 6.  RE: kernel: INFO: task oninit:1720 blocked for more than 120 seconds.

    IBM Champion
    Posted Fri October 28, 2022 09:34 AM
    Hmm, maybe the company that owns RedHat should look into this problem with KAIO? Hmm, who is that?

    Oh, wait, it's IBM!

    ------------------------------
    Art S. Kagel, President and Principal Consultant
    ASK Database Management Corp.
    www.askdbmgt.com
    ------------------------------



  • 7.  RE: kernel: INFO: task oninit:1720 blocked for more than 120 seconds.

    IBM Champion
    Posted Fri October 28, 2022 09:38 AM
    Yes Art, at that "get an information from the OS support" moment I surrendered :)
    Hrvoje


    ------------------------------
    Hrvoje Zokovic
    ------------------------------



  • 8.  RE: kernel: INFO: task oninit:1720 blocked for more than 120 seconds.

    IBM Champion
    Posted Sat November 05, 2022 07:46 PM

    Hi,

    io_submit can block for many reasons - https://stackoverflow.com/questions/34572559/asynchronous-io-io-submit-latency-in-ubuntu-linux

    In this case it appears to be an unaligned write e.g. writing to the 2nd 2K Informix page on a 4K ext4 block.

    https://elixir.bootlin.com/linux/v4.18/source/fs/ext4/file.c#L247

    :* Unaligned direct AIO must be serialized among each other as zeroing
    * of partial blocks of two competing unaligned AIOs can result in data
    * corruption."

    https://elixir.bootlin.com/linux/v4.18/source/fs/ext4/file.c#L106  ext4_unwritten_wait

    This the calls schedule which then invokes the block unplugging mechanism
    https://lwn.net/Articles/438256/

    i.e. have the process  action all pending i/o's!

    Looks like the device took 120s to action all the pending i/os.

    From the comments it is unclear if it needs to submit or action the i/os!

    You could have been hitting the limits in https://stackoverflow.com/questions/34572559/asynchronous-io-io-submit-latency-in-ubuntu-linux item 4

    /sys/block/[disk]/queue/nr_request
    /sys/block/[disk]/device/queue_depth

    or something else.

    It is something for however supports your OS to look into.

    You can also try the stuff in
    https://tanelpoder.com/posts/11m-iops-with-10-ssds-on-amd-threadripper-pro-workstation/

    and use
    - sudo psn -G syscall,wchan,kstack | sed 's/->/\n/g'
    - 0x.tools / run_xcpu.sh runs perf record  / sudo perf report -i

    to get thread stack and see exactly where iin the kernel the time is being spent!

    David.

    ------------------------------
    David Williams
    ------------------------------



  • 9.  RE: kernel: INFO: task oninit:1720 blocked for more than 120 seconds.

    IBM Champion
    Posted Sun November 06, 2022 05:09 AM

    Thanks David,
    customer did not notice any performance degradation after turning DIRECT_IO off - so we'll leave it turned off.
    I'm leaving this discussion to Informix and Red Hat owners (or owner) to deal or not to deal wit it.
    But I'll have this remembered for future use :)

    Ty
    Hrvoje



    ------------------------------
    Hrvoje Zokovic
    ------------------------------



  • 10.  RE: kernel: INFO: task oninit:1720 blocked for more than 120 seconds.

    Posted Tue June 27, 2023 07:00 AM

    Hello everyone,

    did anyone ever find out whether this problem is caused bei IDS or the OS? Having journaling enabled or not on OS level did not make any difference, for example.

    @Hrvoje Zokovic had you had any problems of this kind ever since switching off DIRECT_IO ?

    KR
    Dennis



    ------------------------------
    Dennis Vinueza
    ------------------------------



  • 11.  RE: kernel: INFO: task oninit:1720 blocked for more than 120 seconds.

    IBM Champion
    Posted Tue June 27, 2023 11:30 AM

    I was under the impression that ext4 without journalling is the (unsupported) combination that's causing these hangs - is that not the case?



    ------------------------------
    Andreas Legner
    ------------------------------



  • 12.  RE: kernel: INFO: task oninit:1720 blocked for more than 120 seconds.

    Posted Tue June 27, 2023 11:36 AM

    At least that does not seem to be the only way to reproduce the issue. ext4 with journalling is now active but the problem still occurs after some time. 



    ------------------------------
    Dennis Vinueza
    ------------------------------



  • 13.  RE: kernel: INFO: task oninit:1720 blocked for more than 120 seconds.

    IBM Champion
    Posted Tue June 27, 2023 11:41 AM

    This then had to be brought to the OS/FS vendor's attention as, I'd say, this is fully in their realm ... or they could at least tell what we should be doing differently.



    ------------------------------
    Andreas Legner
    ------------------------------



  • 14.  RE: kernel: INFO: task oninit:1720 blocked for more than 120 seconds.

    IBM Champion
    Posted Tue June 27, 2023 11:50 AM
    when it is hung track down the oninit process and look at the stack - might help

    https://community.ibm.com/community/user/datamanagement/discussion/nasty-redhat-issue#bmd84e1b8a-caba-4140-ae34-722118dc650a

    Cheers
    Paul

    On 6/27/2023 10:35 AM, Dennis Vinueza via IBM TechXchange Community wrote:
    01000188fd7efd43-ebaef315-11ae-4b4b-b23e-b4e800ae862b-000000@email.amazonses.com">
    At least that does not seem to be the only way to reproduce the issue. ext4 with journalling is now active but the problem still occurs after some... -posted to the "Informix" group
    IBM TechXchange Community

    Informix

    Post New Message
    Re: kernel: INFO: task oninit:1720 blocked for more than 120 seconds.
    Reply to Group Reply to Sender
    Dennis Vinueza
    Jun 27, 2023 11:36 AM
    Dennis Vinueza

    At least that does not seem to be the only way to reproduce the issue. ext4 with journalling is now active but the problem still occurs after some time. 



    ------------------------------
    Dennis Vinueza
    ------------------------------
      Reply to Group Online   View Thread   Recommend   Forward   Flag as Inappropriate  




     
    You are subscribed to "Informix" as paul@oninit.com. To change your subscriptions, go to My Subscriptions. To unsubscribe from this community discussion, go to Unsubscribe.



    Original Message:
    Sent: 6/27/2023 11:36:00 AM
    From: Dennis Vinueza
    Subject: RE: kernel: INFO: task oninit:1720 blocked for more than 120 seconds.

    At least that does not seem to be the only way to reproduce the issue. ext4 with journalling is now active but the problem still occurs after some time. 



    ------------------------------
    Dennis Vinueza
    ------------------------------

    Original Message:
    Sent: Tue June 27, 2023 11:30 AM
    From: Andreas Legner
    Subject: kernel: INFO: task oninit:1720 blocked for more than 120 seconds.

    I was under the impression that ext4 without journalling is the (unsupported) combination that's causing these hangs - is that not the case?



    ------------------------------
    Andreas Legner

    Original Message:
    Sent: Tue June 27, 2023 06:59 AM
    From: Dennis Vinueza
    Subject: kernel: INFO: task oninit:1720 blocked for more than 120 seconds.

    Hello everyone,

    did anyone ever find out whether this problem is caused bei IDS or the OS? Having journaling enabled or not on OS level did not make any difference, for example.

    @Hrvoje Zokovic had you had any problems of this kind ever since switching off DIRECT_IO ?

    KR
    Dennis



    ------------------------------
    Dennis Vinueza

    Original Message:
    Sent: Sun November 06, 2022 05:09 AM
    From: Hrvoje Zokovic
    Subject: kernel: INFO: task oninit:1720 blocked for more than 120 seconds.

    Thanks David,
    customer did not notice any performance degradation after turning DIRECT_IO off - so we'll leave it turned off.
    I'm leaving this discussion to Informix and Red Hat owners (or owner) to deal or not to deal wit it.
    But I'll have this remembered for future use :)

    Ty
    Hrvoje



    ------------------------------
    Hrvoje Zokovic

    Original Message:
    Sent: Sat November 05, 2022 07:46 PM
    From: David Williams
    Subject: kernel: INFO: task oninit:1720 blocked for more than 120 seconds.


    Hi,

    io_submit can block for many reasons - https://stackoverflow.com/questions/34572559/asynchronous-io-io-submit-latency-in-ubuntu-linux

    In this case it appears to be an unaligned write e.g. writing to the 2nd 2K Informix page on a 4K ext4 block.

    https://elixir.bootlin.com/linux/v4.18/source/fs/ext4/file.c#L247

    :* Unaligned direct AIO must be serialized among each other as zeroing
    * of partial blocks of two competing unaligned AIOs can result in data
    * corruption."

    https://elixir.bootlin.com/linux/v4.18/source/fs/ext4/file.c#L106  ext4_unwritten_wait

    This the calls schedule which then invokes the block unplugging mechanism
    https://lwn.net/Articles/438256/

    i.e. have the process  action all pending i/o's!

    Looks like the device took 120s to action all the pending i/os.

    From the comments it is unclear if it needs to submit or action the i/os!

    You could have been hitting the limits in https://stackoverflow.com/questions/34572559/asynchronous-io-io-submit-latency-in-ubuntu-linux item 4

    /sys/block/[disk]/queue/nr_request
    /sys/block/[disk]/device/queue_depth

    or something else.

    It is something for however supports your OS to look into.

    You can also try the stuff in
    https://tanelpoder.com/posts/11m-iops-with-10-ssds-on-amd-threadripper-pro-workstation/

    and use
    - sudo psn -G syscall,wchan,kstack | sed 's/->/\n/g'
    - 0x.tools / run_xcpu.sh runs perf record  / sudo perf report -i

    to get thread stack and see exactly where iin the kernel the time is being spent!

    David.

    ------------------------------
    David Williams

    Original Message:
    Sent: Tue March 08, 2022 04:14 AM
    From: Hrvoje Zokovic
    Subject: kernel: INFO: task oninit:1720 blocked for more than 120 seconds.

    Anyone has experience/solution with this:
    Feb 14 14:28:46 dbpri kernel: INFO: task oninit:1720 blocked for more than 120 seconds.
    Feb 14 14:28:46 dbpri kernel:   Not tainted 4.18.0-348.12.2.el8_5.x86_64 #1
    Feb 14 14:28:46 dbpri kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    Feb 14 14:28:46 dbpri kernel: task:oninit     state:D stack:  0 pid: 1720 ppid: 1714 flags:0x00000084
    Feb 14 14:28:46 dbpri kernel: Call Trace:
    Feb 14 14:28:46 dbpri kernel: __schedule+0x2bd/0x760
    Feb 14 14:28:46 dbpri kernel: ? blk_flush_plug_list+0xd7/0x100
    Feb 14 14:28:46 dbpri kernel: schedule+0x37/0xa0
    Feb 14 14:28:46 dbpri kernel: ext4_unwritten_wait+0x83/0xb0 [ext4]
    Feb 14 14:28:46 dbpri kernel: ? finish_wait+0x80/0x80
    Feb 14 14:28:46 dbpri kernel: ext4_file_write_iter+0x392/0x3e0 [ext4]
    Feb 14 14:28:46 dbpri kernel: aio_write+0xf6/0x1c0
    Feb 14 14:28:46 dbpri kernel: ? blk_mq_sched_insert_requests+0x71/0xf0
    Feb 14 14:28:46 dbpri kernel: ? io_submit_one+0x7e/0x3c0
    Feb 14 14:28:46 dbpri kernel: ? kmem_cache_alloc+0x12e/0x270
    Feb 14 14:28:46 dbpri kernel: io_submit_one+0x131/0x3c0
    Feb 14 14:28:46 dbpri kernel: __x64_sys_io_submit+0xa2/0x180
    Feb 14 14:28:46 dbpri kernel: ? syscall_trace_enter+0x1d3/0x2c0
    Feb 14 14:28:46 dbpri kernel: do_syscall_64+0x5b/0x1a0
    Feb 14 14:28:46 dbpri kernel: entry_SYSCALL_64_after_hwframe+0x65/0xca
    Feb 14 14:28:46 dbpri kernel: RIP: 0033:0x7fbb9d46652d
    Feb 14 14:28:46 dbpri kernel: Code: Unable to access opcode bytes at RIP 0x7fbb9d466503.
    Feb 14 14:28:46 dbpri kernel: RSP: 002b:00000000500d7da8 EFLAGS: 00000246 ORIG_RAX: 00000000000000d1
    Feb 14 14:28:46 dbpri kernel: RAX: ffffffffffffffda RBX: 00007fbb9ecf7280 RCX: 00007fbb9d46652d
    Feb 14 14:28:46 dbpri kernel: RDX: 00000000500d7de8 RSI: 0000000000000001 RDI: 00007fbb98207000
    Feb 14 14:28:46 dbpri kernel: RBP: 00007fbb98207000 R08: 0000000002229800 R09: 00000000538288d0
    Feb 14 14:28:46 dbpri kernel: R10: ffffffffffffffe0 R11: 0000000000000246 R12: 0000000000000001
    Feb 14 14:28:46 dbpri kernel: R13: 0000000000000000 R14: 00000000500d7de8 R15: 0000000053828850

    It is Red Hat Enterprise Linux 8.5 (Ootpa)
    Linux dbpri 4.18.0-348.12.2.el8_5.x86_64 #1 SMP Mon Jan 17 07:06:06 EST 2022 x86_64 x86_64 x86_64 GNU/Linux

    I opened support case but no answer yet
    Ofc I do not want to disable this message :)

    Hrvoje

    ------------------------------
    Hrvoje Zokovic
    ------------------------------
    #Informix


  • 15.  RE: kernel: INFO: task oninit:1720 blocked for more than 120 seconds.

    Posted Tue June 27, 2023 12:05 PM

    Hi, yes, we tried that already. pstack does not return anything, but onstat -g stk returns something very similar as from the other post. But, as said, journalling is on. Any other ext4 settings I might be missing?

    Stack for thread: 13 flush_sub(0)
     base: 0x00000000cbfa7000
      len:   69632
       pc: 0x00000000014f95db
      tos: 0x00000000cbfb7c10
    state: sleeping
       vp: 1
     
    0x00000000014f95db (oninit) yield_processor_mvp
    0x00000000014b88d9 (oninit) mt_aio_wait.part.9
    0x00000000014af63e (oninit) mt_aio_start
    0x0000000000eab72a (oninit) aclean_chunk
    0x0000000000eaab17 (oninit) flush_sub
    0x000000000150e5cf (oninit) startup
     
    dumpe2fs /dev/sdb1 | grep journal

    Filesystem features:      has_journal ext_attr resize_inode dir_index filetype extent 64bit flex_bg sparse_super large_file huge_file dir_nlink extra_isize metadata_csum


    ------------------------------
    Dennis Vinueza
    ------------------------------



  • 16.  RE: kernel: INFO: task oninit:1720 blocked for more than 120 seconds.

    IBM Champion
    Posted Tue June 27, 2023 12:13 PM

    oninit stacks, I'd say, in this case actually only a (sleeping) Informix thread's stack aren't that helpful.  All you can glean from them is Informix is waiting on some IO, more specifically on an IO request submitted to the OS kernel.

    The actual problem occurs in the OS kernel, with an OS IO worker process/thread processing our request and being hung interacting with the file system.  These kernel stacks, afair, are contained in the dmesg output.



    ------------------------------
    Andreas Legner
    ------------------------------



  • 17.  RE: kernel: INFO: task oninit:1720 blocked for more than 120 seconds.

    IBM Champion
    Posted Tue June 27, 2023 12:15 PM
    Yep, dmesg was what helped me track it down

    On 6/27/2023 11:12 AM, Andreas Legner via IBM TechXchange Community wrote:
    01000188fda0de26-cad81b4e-ed1a-45b7-ae67-0a0ba2666153-000000@email.amazonses.com">
    oninit stacks, I'd say, in this case actually only a (sleeping) Informix thread's stack aren't that helpful.  All you can glean from them is...
    IBM TechXchange Community

    Informix

    Post New Message
    Re: kernel: INFO: task oninit:1720 blocked for more than 120 seconds.
    Reply to Group Reply to Sender
    Andreas Legner
    Jun 27, 2023 12:13 PM
    Andreas Legner

    oninit stacks, I'd say, in this case actually only a (sleeping) Informix thread's stack aren't that helpful.  All you can glean from them is Informix is waiting on some IO, more specifically on an IO request submitted to the OS kernel.

    The actual problem occurs in the OS kernel, with an OS IO worker process/thread processing our request and being hung interacting with the file system.  These kernel stacks, afair, are contained in the dmesg output.



    ------------------------------
    Andreas Legner
    ------------------------------
      Reply to Group Online   View Thread   Recommend   Forward   Flag as Inappropriate  




     
    You are subscribed to "Informix" as paul@oninit.com. To change your subscriptions, go to My Subscriptions. To unsubscribe from this community discussion, go to Unsubscribe.



    Original Message:
    Sent: 6/27/2023 12:13:00 PM
    From: Andreas Legner
    Subject: RE: kernel: INFO: task oninit:1720 blocked for more than 120 seconds.

    oninit stacks, I'd say, in this case actually only a (sleeping) Informix thread's stack aren't that helpful.  All you can glean from them is Informix is waiting on some IO, more specifically on an IO request submitted to the OS kernel.

    The actual problem occurs in the OS kernel, with an OS IO worker process/thread processing our request and being hung interacting with the file system.  These kernel stacks, afair, are contained in the dmesg output.



    ------------------------------
    Andreas Legner
    ------------------------------

    Original Message:
    Sent: Tue June 27, 2023 12:04 PM
    From: Dennis Vinueza
    Subject: kernel: INFO: task oninit:1720 blocked for more than 120 seconds.

    Hi, yes, we tried that already. pstack does not return anything, but onstat -g stk returns something very similar as from the other post. But, as said, journalling is on. Any other ext4 settings I might be missing?

    Stack for thread: 13 flush_sub(0)
     base: 0x00000000cbfa7000
      len:   69632
       pc: 0x00000000014f95db
      tos: 0x00000000cbfb7c10
    state: sleeping
       vp: 1
     
    0x00000000014f95db (oninit) yield_processor_mvp
    0x00000000014b88d9 (oninit) mt_aio_wait.part.9
    0x00000000014af63e (oninit) mt_aio_start
    0x0000000000eab72a (oninit) aclean_chunk
    0x0000000000eaab17 (oninit) flush_sub
    0x000000000150e5cf (oninit) startup
     
    dumpe2fs /dev/sdb1 | grep journal

    Filesystem features:      has_journal ext_attr resize_inode dir_index filetype extent 64bit flex_bg sparse_super large_file huge_file dir_nlink extra_isize metadata_csum


    ------------------------------
    Dennis Vinueza

    Original Message:
    Sent: Tue June 27, 2023 11:49 AM
    From: Paul Watson
    Subject: kernel: INFO: task oninit:1720 blocked for more than 120 seconds.

    when it is hung track down the oninit process and look at the stack - might help

    https://community.ibm.com/community/user/datamanagement/discussion/nasty-redhat-issue#bmd84e1b8a-caba-4140-ae34-722118dc650a

    Cheers
    Paul

    On 6/27/2023 10:35 AM, Dennis Vinueza via IBM TechXchange Community wrote:
    01000188fd7efd43-ebaef315-11ae-4b4b-b23e-b4e800ae862b-000000@email.amazonses.com">
    At least that does not seem to be the only way to reproduce the issue. ext4 with journalling is now active but the problem still occurs after some... -posted to the "Informix" group
    IBM TechXchange Community

    Informix

    Post New Message
    Re: kernel: INFO: task oninit:1720 blocked for more than 120 seconds.
    Reply to Group Reply to Sender
    Dennis Vinueza
    Jun 27, 2023 11:36 AM
    Dennis Vinueza

    At least that does not seem to be the only way to reproduce the issue. ext4 with journalling is now active but the problem still occurs after some time. 

    ------------------------------ Dennis Vinueza ------------------------------
      Reply to Group Online   View Thread   Recommend   Forward   Flag as Inappropriate  
    Original Message: Sent: Tue June 27, 2023 11:30 AM
     
    You are subscribed to "Informix" as paul@oninit.com. To change your subscriptions, go to My Subscriptions. To unsubscribe from this community discussion, go to Unsubscribe.



    Original Message:
    Sent: 6/27/2023 11:36:00 AM
    From: Dennis Vinueza
    Subject: RE: kernel: INFO: task oninit:1720 blocked for more than 120 seconds.

    At least that does not seem to be the only way to reproduce the issue. ext4 with journalling is now active but the problem still occurs after some time. 



    ------------------------------
    Dennis Vinueza

    Original Message:
    Sent: Tue June 27, 2023 11:30 AM
    From: Andreas Legner
    Subject: kernel: INFO: task oninit:1720 blocked for more than 120 seconds.

    I was under the impression that ext4 without journalling is the (unsupported) combination that's causing these hangs - is that not the case?



    ------------------------------
    Andreas Legner

    Original Message:
    Sent: Tue June 27, 2023 06:59 AM
    From: Dennis Vinueza
    Subject: kernel: INFO: task oninit:1720 blocked for more than 120 seconds.

    Hello everyone,

    did anyone ever find out whether this problem is caused bei IDS or the OS? Having journaling enabled or not on OS level did not make any difference, for example.

    @Hrvoje Zokovic had you had any problems of this kind ever since switching off DIRECT_IO ?

    KR
    Dennis



    ------------------------------
    Dennis Vinueza

    Original Message:
    Sent: Sun November 06, 2022 05:09 AM
    From: Hrvoje Zokovic
    Subject: kernel: INFO: task oninit:1720 blocked for more than 120 seconds.

    Thanks David,
    customer did not notice any performance degradation after turning DIRECT_IO off - so we'll leave it turned off.
    I'm leaving this discussion to Informix and Red Hat owners (or owner) to deal or not to deal wit it.
    But I'll have this remembered for future use :)

    Ty
    Hrvoje



    ------------------------------
    Hrvoje Zokovic

    Original Message:
    Sent: Sat November 05, 2022 07:46 PM
    From: David Williams
    Subject: kernel: INFO: task oninit:1720 blocked for more than 120 seconds.


    Hi,

    io_submit can block for many reasons - https://stackoverflow.com/questions/34572559/asynchronous-io-io-submit-latency-in-ubuntu-linux

    In this case it appears to be an unaligned write e.g. writing to the 2nd 2K Informix page on a 4K ext4 block.

    https://elixir.bootlin.com/linux/v4.18/source/fs/ext4/file.c#L247

    :* Unaligned direct AIO must be serialized among each other as zeroing
    * of partial blocks of two competing unaligned AIOs can result in data
    * corruption."

    https://elixir.bootlin.com/linux/v4.18/source/fs/ext4/file.c#L106  ext4_unwritten_wait

    This the calls schedule which then invokes the block unplugging mechanism
    https://lwn.net/Articles/438256/

    i.e. have the process  action all pending i/o's!

    Looks like the device took 120s to action all the pending i/os.

    From the comments it is unclear if it needs to submit or action the i/os!

    You could have been hitting the limits in https://stackoverflow.com/questions/34572559/asynchronous-io-io-submit-latency-in-ubuntu-linux item 4

    /sys/block/[disk]/queue/nr_request
    /sys/block/[disk]/device/queue_depth

    or something else.

    It is something for however supports your OS to look into.

    You can also try the stuff in
    https://tanelpoder.com/posts/11m-iops-with-10-ssds-on-amd-threadripper-pro-workstation/

    and use
    - sudo psn -G syscall,wchan,kstack | sed 's/->/\n/g'
    - 0x.tools / run_xcpu.sh runs perf record  / sudo perf report -i

    to get thread stack and see exactly where iin the kernel the time is being spent!

    David.

    ------------------------------
    David Williams

    Original Message:
    Sent: Tue March 08, 2022 04:14 AM
    From: Hrvoje Zokovic
    Subject: kernel: INFO: task oninit:1720 blocked for more than 120 seconds.

    Anyone has experience/solution with this:
    Feb 14 14:28:46 dbpri kernel: INFO: task oninit:1720 blocked for more than 120 seconds.
    Feb 14 14:28:46 dbpri kernel:   Not tainted 4.18.0-348.12.2.el8_5.x86_64 #1
    Feb 14 14:28:46 dbpri kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    Feb 14 14:28:46 dbpri kernel: task:oninit     state:D stack:  0 pid: 1720 ppid: 1714 flags:0x00000084
    Feb 14 14:28:46 dbpri kernel: Call Trace:
    Feb 14 14:28:46 dbpri kernel: __schedule+0x2bd/0x760
    Feb 14 14:28:46 dbpri kernel: ? blk_flush_plug_list+0xd7/0x100
    Feb 14 14:28:46 dbpri kernel: schedule+0x37/0xa0
    Feb 14 14:28:46 dbpri kernel: ext4_unwritten_wait+0x83/0xb0 [ext4]
    Feb 14 14:28:46 dbpri kernel: ? finish_wait+0x80/0x80
    Feb 14 14:28:46 dbpri kernel: ext4_file_write_iter+0x392/0x3e0 [ext4]
    Feb 14 14:28:46 dbpri kernel: aio_write+0xf6/0x1c0
    Feb 14 14:28:46 dbpri kernel: ? blk_mq_sched_insert_requests+0x71/0xf0
    Feb 14 14:28:46 dbpri kernel: ? io_submit_one+0x7e/0x3c0
    Feb 14 14:28:46 dbpri kernel: ? kmem_cache_alloc+0x12e/0x270
    Feb 14 14:28:46 dbpri kernel: io_submit_one+0x131/0x3c0
    Feb 14 14:28:46 dbpri kernel: __x64_sys_io_submit+0xa2/0x180
    Feb 14 14:28:46 dbpri kernel: ? syscall_trace_enter+0x1d3/0x2c0
    Feb 14 14:28:46 dbpri kernel: do_syscall_64+0x5b/0x1a0
    Feb 14 14:28:46 dbpri kernel: entry_SYSCALL_64_after_hwframe+0x65/0xca
    Feb 14 14:28:46 dbpri kernel: RIP: 0033:0x7fbb9d46652d
    Feb 14 14:28:46 dbpri kernel: Code: Unable to access opcode bytes at RIP 0x7fbb9d466503.
    Feb 14 14:28:46 dbpri kernel: RSP: 002b:00000000500d7da8 EFLAGS: 00000246 ORIG_RAX: 00000000000000d1
    Feb 14 14:28:46 dbpri kernel: RAX: ffffffffffffffda RBX: 00007fbb9ecf7280 RCX: 00007fbb9d46652d
    Feb 14 14:28:46 dbpri kernel: RDX: 00000000500d7de8 RSI: 0000000000000001 RDI: 00007fbb98207000
    Feb 14 14:28:46 dbpri kernel: RBP: 00007fbb98207000 R08: 0000000002229800 R09: 00000000538288d0
    Feb 14 14:28:46 dbpri kernel: R10: ffffffffffffffe0 R11: 0000000000000246 R12: 0000000000000001
    Feb 14 14:28:46 dbpri kernel: R13: 0000000000000000 R14: 00000000500d7de8 R15: 0000000053828850

    It is Red Hat Enterprise Linux 8.5 (Ootpa)
    Linux dbpri 4.18.0-348.12.2.el8_5.x86_64 #1 SMP Mon Jan 17 07:06:06 EST 2022 x86_64 x86_64 x86_64 GNU/Linux

    I opened support case but no answer yet
    Ofc I do not want to disable this message :)

    Hrvoje

    ------------------------------
    Hrvoje Zokovic
    ------------------------------
    #Informix


  • 18.  RE: kernel: INFO: task oninit:1720 blocked for more than 120 seconds.

    Posted Wed June 28, 2023 02:57 AM
    Edited by Dennis Vinueza Wed June 28, 2023 03:03 AM

    Hi, 

    we already opened a ticket to help track it down.

    So, the dmesg shows;

     [4552624.303025] task:kworker/2:0     state:D stack:    0 pid:1812473 ppid:                                                                                                  2 flags:0x80004080
    [4552624.303035] Workqueue: dio/sdb1 dio_aio_complete_work
    [4552624.303046] Call Trace:
    [4552624.303052]  __schedule+0x2d1/0x860
    [4552624.303062]  schedule+0x35/0xa0
    [4552624.303069]  rwsem_down_write_slowpath+0x30c/0x5c0
    [4552624.303082]  __generic_file_fsync+0x42/0xc0
    [4552624.303093]  ext4_sync_file+0x1d0/0x3b0 [ext4]
    [4552624.303237]  dio_complete+0x1e3/0x240
    [4552624.303249]  process_one_work+0x1a7/0x360
    [4552624.303258]  worker_thread+0x30/0x390
    [4552624.303265]  ? create_worker+0x1a0/0x1a0
    [4552624.303272]  kthread+0x10b/0x130
    [4552624.303281]  ? set_kthread_struct+0x50/0x50
    [4552624.303291]  ret_from_fork+0x1f/0x40
    [4552624.303308] INFO: task kworker/2:2:1817377 blocked for more than 120 second                                                                                             s.
    [4552624.303450]       Not tainted 4.18.0-425.3.1.el8.x86_64 #1
    [4552624.303560] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables thi                                                                                             s message.
    [4552624.303713] task:kworker/2:2     state:D stack:    0 pid:1817377 ppid:                                                                                                  2 flags:0x80004080
    [4552624.303723] Workqueue: dio/sdb1 dio_aio_complete_work
    [4552624.303733] Call Trace:
    [4552624.303738]  __schedule+0x2d1/0x860
    [4552624.303746]  ? __sched_setscheduler+0x4d6/0x8a0
    [4552624.303758]  schedule+0x35/0xa0
    [4552624.303765]  rwsem_down_write_slowpath+0x30c/0x5c0
    [4552624.303778]  __generic_file_fsync+0x42/0xc0
    [4552624.303789]  ext4_sync_file+0x1d0/0x3b0 [ext4]
    [4552624.303909]  dio_complete+0x1e3/0x240
    [4552624.303920]  process_one_work+0x1a7/0x360
    [4552624.303929]  worker_thread+0x30/0x390
    [4552624.303937]  ? create_worker+0x1a0/0x1a0
    [4552624.303944]  kthread+0x10b/0x130
    [4552624.303953]  ? set_kthread_struct+0x50/0x50
    [4552624.303963]  ret_from_fork+0x1f/0x40

    KR

    Dennis



    ------------------------------
    Dennis Vinueza
    ------------------------------



  • 19.  RE: kernel: INFO: task oninit:1720 blocked for more than 120 seconds.

    IBM Champion
    Posted Tue June 27, 2023 11:44 AM
    ext4 without journalling will hang the engine permanently and engine cannot be killed even with a kill-9 on the PID.  dmesg will show the issue

    While RH recognize it as an issue they d not support ext4 without a journal

    On 6/27/2023 10:30 AM, Andreas Legner via IBM TechXchange Community wrote:
    01000188fd7a0808-2976f5ec-6f82-4c20-a281-541e56d97d5b-000000@email.amazonses.com">
    I was under the impression that ext4 without journalling is the (unsupported) combination that's causing these hangs - is that not the case? ---...
    IBM TechXchange Community

    Informix

    Post New Message
    Re: kernel: INFO: task oninit:1720 blocked for more than 120 seconds.
    Reply to Group Reply to Sender
    Andreas Legner
    Jun 27, 2023 11:30 AM
    Andreas Legner

    I was under the impression that ext4 without journalling is the (unsupported) combination that's causing these hangs - is that not the case?



    ------------------------------
    Andreas Legner
    ------------------------------
      Reply to Group Online   View Thread   Recommend   Forward   Flag as Inappropriate  




     
    You are subscribed to "Informix" as paul@oninit.com. To change your subscriptions, go to My Subscriptions. To unsubscribe from this community discussion, go to Unsubscribe.



    Original Message:
    Sent: 6/27/2023 11:30:00 AM
    From: Andreas Legner
    Subject: RE: kernel: INFO: task oninit:1720 blocked for more than 120 seconds.

    I was under the impression that ext4 without journalling is the (unsupported) combination that's causing these hangs - is that not the case?



    ------------------------------
    Andreas Legner
    ------------------------------

    Original Message:
    Sent: Tue June 27, 2023 06:59 AM
    From: Dennis Vinueza
    Subject: kernel: INFO: task oninit:1720 blocked for more than 120 seconds.

    Hello everyone,

    did anyone ever find out whether this problem is caused bei IDS or the OS? Having journaling enabled or not on OS level did not make any difference, for example.

    @Hrvoje Zokovic had you had any problems of this kind ever since switching off DIRECT_IO ?

    KR
    Dennis



    ------------------------------
    Dennis Vinueza

    Original Message:
    Sent: Sun November 06, 2022 05:09 AM
    From: Hrvoje Zokovic
    Subject: kernel: INFO: task oninit:1720 blocked for more than 120 seconds.

    Thanks David,
    customer did not notice any performance degradation after turning DIRECT_IO off - so we'll leave it turned off.
    I'm leaving this discussion to Informix and Red Hat owners (or owner) to deal or not to deal wit it.
    But I'll have this remembered for future use :)

    Ty
    Hrvoje



    ------------------------------
    Hrvoje Zokovic

    Original Message:
    Sent: Sat November 05, 2022 07:46 PM
    From: David Williams
    Subject: kernel: INFO: task oninit:1720 blocked for more than 120 seconds.


    Hi,

    io_submit can block for many reasons - https://stackoverflow.com/questions/34572559/asynchronous-io-io-submit-latency-in-ubuntu-linux

    In this case it appears to be an unaligned write e.g. writing to the 2nd 2K Informix page on a 4K ext4 block.

    https://elixir.bootlin.com/linux/v4.18/source/fs/ext4/file.c#L247

    :* Unaligned direct AIO must be serialized among each other as zeroing
    * of partial blocks of two competing unaligned AIOs can result in data
    * corruption."

    https://elixir.bootlin.com/linux/v4.18/source/fs/ext4/file.c#L106  ext4_unwritten_wait

    This the calls schedule which then invokes the block unplugging mechanism
    https://lwn.net/Articles/438256/

    i.e. have the process  action all pending i/o's!

    Looks like the device took 120s to action all the pending i/os.

    From the comments it is unclear if it needs to submit or action the i/os!

    You could have been hitting the limits in https://stackoverflow.com/questions/34572559/asynchronous-io-io-submit-latency-in-ubuntu-linux item 4

    /sys/block/[disk]/queue/nr_request
    /sys/block/[disk]/device/queue_depth

    or something else.

    It is something for however supports your OS to look into.

    You can also try the stuff in
    https://tanelpoder.com/posts/11m-iops-with-10-ssds-on-amd-threadripper-pro-workstation/

    and use
    - sudo psn -G syscall,wchan,kstack | sed 's/->/\n/g'
    - 0x.tools / run_xcpu.sh runs perf record  / sudo perf report -i

    to get thread stack and see exactly where iin the kernel the time is being spent!

    David.

    ------------------------------
    David Williams

    Original Message:
    Sent: Tue March 08, 2022 04:14 AM
    From: Hrvoje Zokovic
    Subject: kernel: INFO: task oninit:1720 blocked for more than 120 seconds.

    Anyone has experience/solution with this:
    Feb 14 14:28:46 dbpri kernel: INFO: task oninit:1720 blocked for more than 120 seconds.
    Feb 14 14:28:46 dbpri kernel:   Not tainted 4.18.0-348.12.2.el8_5.x86_64 #1
    Feb 14 14:28:46 dbpri kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    Feb 14 14:28:46 dbpri kernel: task:oninit     state:D stack:  0 pid: 1720 ppid: 1714 flags:0x00000084
    Feb 14 14:28:46 dbpri kernel: Call Trace:
    Feb 14 14:28:46 dbpri kernel: __schedule+0x2bd/0x760
    Feb 14 14:28:46 dbpri kernel: ? blk_flush_plug_list+0xd7/0x100
    Feb 14 14:28:46 dbpri kernel: schedule+0x37/0xa0
    Feb 14 14:28:46 dbpri kernel: ext4_unwritten_wait+0x83/0xb0 [ext4]
    Feb 14 14:28:46 dbpri kernel: ? finish_wait+0x80/0x80
    Feb 14 14:28:46 dbpri kernel: ext4_file_write_iter+0x392/0x3e0 [ext4]
    Feb 14 14:28:46 dbpri kernel: aio_write+0xf6/0x1c0
    Feb 14 14:28:46 dbpri kernel: ? blk_mq_sched_insert_requests+0x71/0xf0
    Feb 14 14:28:46 dbpri kernel: ? io_submit_one+0x7e/0x3c0
    Feb 14 14:28:46 dbpri kernel: ? kmem_cache_alloc+0x12e/0x270
    Feb 14 14:28:46 dbpri kernel: io_submit_one+0x131/0x3c0
    Feb 14 14:28:46 dbpri kernel: __x64_sys_io_submit+0xa2/0x180
    Feb 14 14:28:46 dbpri kernel: ? syscall_trace_enter+0x1d3/0x2c0
    Feb 14 14:28:46 dbpri kernel: do_syscall_64+0x5b/0x1a0
    Feb 14 14:28:46 dbpri kernel: entry_SYSCALL_64_after_hwframe+0x65/0xca
    Feb 14 14:28:46 dbpri kernel: RIP: 0033:0x7fbb9d46652d
    Feb 14 14:28:46 dbpri kernel: Code: Unable to access opcode bytes at RIP 0x7fbb9d466503.
    Feb 14 14:28:46 dbpri kernel: RSP: 002b:00000000500d7da8 EFLAGS: 00000246 ORIG_RAX: 00000000000000d1
    Feb 14 14:28:46 dbpri kernel: RAX: ffffffffffffffda RBX: 00007fbb9ecf7280 RCX: 00007fbb9d46652d
    Feb 14 14:28:46 dbpri kernel: RDX: 00000000500d7de8 RSI: 0000000000000001 RDI: 00007fbb98207000
    Feb 14 14:28:46 dbpri kernel: RBP: 00007fbb98207000 R08: 0000000002229800 R09: 00000000538288d0
    Feb 14 14:28:46 dbpri kernel: R10: ffffffffffffffe0 R11: 0000000000000246 R12: 0000000000000001
    Feb 14 14:28:46 dbpri kernel: R13: 0000000000000000 R14: 00000000500d7de8 R15: 0000000053828850

    It is Red Hat Enterprise Linux 8.5 (Ootpa)
    Linux dbpri 4.18.0-348.12.2.el8_5.x86_64 #1 SMP Mon Jan 17 07:06:06 EST 2022 x86_64 x86_64 x86_64 GNU/Linux

    I opened support case but no answer yet
    Ofc I do not want to disable this message :)

    Hrvoje

    ------------------------------
    Hrvoje Zokovic
    ------------------------------
    #Informix


  • 20.  RE: kernel: INFO: task oninit:1720 blocked for more than 120 seconds.

    IBM Champion
    Posted Thu July 13, 2023 01:32 PM

    Hi @Dennis Vinueza 

    sorry for 2 weeks delay. Customer did not have any issue since DIRECT_IO was turned off. They even did not notice any performance issues (because not using kaio).

    HTH

    Hrvoje



    ------------------------------
    Hrvoje Zokovic
    ------------------------------