simscitizen 6 hours ago

Oh I've debugged this before. Native memory allocator had a scavenge function which suspended all other threads. Managed language runtime had a stop the world phase which suspended all mutator threads. They ran at about the same time and ended up suspending each other. To fix this you need to enforce some sort of hierarchy or mutual exclusion for suspension requests.

> Why you should never suspend a thread in your own process.

This sounds like a good general princple but suspending threads in your own process is kind of necessary for e.g. many GC algorithms. Now imagine multiple of those runtimes running in the same process.

  • hyperpape 3 hours ago

    > suspending threads in your own process is kind of necessary for e.g. many GC algorithms

    I think this is typically done by having the compiler/runtime insert safepoints, which cooperatively yield at specified points to allow the GC to run without mutator threads being active. Done correctly, this shouldn't be subject to the problem the original post highlighted, because it doesn't rely on the OS's ability to suspend threads when they aren't expecting it.

boxed 2 hours ago

I had a support issue once at a well known and big US defense firm. We got kernel hangs consistently in kernel space from normal user-level code. Crazy shit. I opened a support issue which eventually got closed because we used an old compiler. Fun times.

zavec 4 hours ago

I knew from seeing a title like that on microsoft.com that it was going to be a Raymond Chen post! He writes fascinating stuff.

  • eyelidlessness 2 hours ago

    I thought the same thing. It’s usually content that’s well outside my areas of familiarity, often even outside my areas of interest. But I usually find his writing interesting enough to read through anyway, and clear enough that I can usually follow it even without familiarity with the subject matter.

ot 5 hours ago

On Linux you'd do this by sending a signal to the thread you want to analyze, and then the signal handler would take the stack trace and send it back to the watchdog.

The tricky part is ensuring that the signal handler code is async-signal-safe (which pretty much boils down to "ensure you're not acquiring any locks and be careful about reentrant code"), but at least that only has to be verified for a self-contained small function.

Is there anything similar to signals on Windows?

frabona an hour ago

Such a clean breakdown. "Don’t suspend your own threads" should be tattooed on every Windows dev’s arm at this point

pitterpatter 4 hours ago

Reminds me of a hang in the Settings UI that was because it would get stuck on an RPC call to some service.

Why was the service holding things up? Because it was waiting on acquiring a lock held by one of its other threads.

What was that other thread doing? It was deadlocked because it tried to recursively acquire an exclusive srwlock (exactly what the docs say will happen if you try).

Why was it even trying to reacquire said lock? Ultimately because of a buffer overrun that ended up overwriting some important structures.

markus_zhang 4 hours ago

Although I understand nothing from these posts, read Raymond's posts somehow always "tranquil" my inner struggles.

Just curious, is this customer a game studio? I have never done any serious system programming but the gist feels like one.

  • ajkjk 3 hours ago

    I would guess it's something corporate. They can afford to pause the UI and ship debugging traces home more than a real-time game might.

    • delusional 3 hours ago

      Id actually expect a customer facing program more. Corporate software wouldn't care that the UI hung, you're getting paid to sit there and look at it.

      • skissane 31 minutes ago

        > Corporate software wouldn't care that the UI hung, you're getting paid to sit there and look at it.

        The article says the thread had been hung for 5 hours. And if you understand the root cause, once it entered into the hung state, then absent some rather dramatic intervention (e.g. manually resuming the suspended UI thread), it would remain hung indefinitely.

        The proper solution, as Raymond Chen notes, is to move the monitoring thread into a separate process, that would avoid this deadlock.

      • tedunangst 2 hours ago

        The banker trying to close a deal isn't paid by the hour.

      • immibis an hour ago

        Unless the user's boss complained to the programmer's boss

makz 2 hours ago

Looking at the title, at first I thought “uh?”, but then I saw microsoft and it made sense.

rat87 4 hours ago

Reminds me of a bug that would bluescreen windows if I stopped Visual Studio debugging if it was in the middle of calling the native Ping from C#

  • bob1029 3 hours ago

    I've been able to get managed code to BSOD my machine by simply having a lot of thread instances that are aggressively communicating with each other (i.e., via Channel<T>). It's probably more of a hardware thing than a software thing. My Spotify fails to keep the audio buffer filled when I've got it fully saturated. I feel like the kernel occasionally panics when something doesn't resolve fast enough with regard to threads across core complexes.

brcmthrowaway 4 hours ago

Can this happen with Grand Central Dispatch ?

  • immibis an hour ago

    did... did you understand what the bug was?