Saturday, December 19, 2020

Dissapointing

Well time goes on at a rapid pace, and people who tell you it goes faster as you get older aren't lying. Its been a particularly hard year this year, with the issues surrounding covid in the background, but AROS still moves on steadily as ever.

One thing that has left me disappointed though, is unfortunately stumbling across "AROS Exec" - a forum I used to think of as home, back when it had a friendly respectful community. Its a shame those days are long gone. The place is all but dead with only a hand full of die hard, vocal, "anti ABI v1" fanboys spreading their long tired stories about how bad ABI v1 is, remaining - and chasing any prospective new members away with their rude attitudes and bad information.

We once tried hard to keep the Amiga troll like characters out of the forums, because we knew it would end this way - its just sad to see it has.

Equally as disappointing is the disparaging remarks found there about the hard work being put into AROS by the developers. ABI v1 is both praised without acknowledgment whenever the improvements and fixes are backported to the "ABI v0(*)" code they are using, attributing the improvements and fixes to the people doing the backporting  - and also claimed to be an unstable mess with little done to improve it.

Despite the huge amount of debugging, number of fixes to the underlying system, improvements to cater for running on m68k and newer pc hardware (and fixes to allow AROS to boot on considerably more systems than it ever did in the past), They make statements about how ABI v1 hasn't had any work or improvement done - in part because they are already using the improvements that were not meant to be part of ABI v0 under the Frankenstein version they do use, and don't attribute the improvements to the AROS Developers but instead to the distro maintainers for backporting it.

It really is sad to see people so tied up in a false narrative that they cannot even see how foolish the path they go down is.

I'm glad this year is almost over,

Going back to AROS though, I can not wait to see what the next year brings. There has been some recent interest in updating the localization for a number of languages - something that has always been an annoyance for me.  Unlike code, you cant just jump into another language and correct things or google translate will make a complete farce of it - and despite the number of supposed users and developers AROS has had from all over the world - very few show the interest or dedication in making sure their language and localization is properly supported,

We made some changes this year to separate the actual translations out of the code base, because it was apparent some people where interested in contributing but didn't know where to begin in the bewildering source tree that is AROS. To cater for this, all the translations are now handling in a sister "site" - the AROS Translation Team, on GitHub.  This allows people who only want to help towards supporting their language, can easily contribute and have the improvements included with AROS over time. There is still a long way to go until using it is comfortable, and we have a nice set of guides for people to get involved - but its a start in the right direction.

LLVM is getting closer and closer to building AROS completely, with only a few smaller niggles still to iron out before the first testable binaries should come off the press, however hosting nightly builds for it is proving somewhat of a problem. Just building the LLVM toolchain takes more time than most complete AROS builds including the GNU toolchain, and need massively more space - more than the virtual environments hosting the builds provide in fact. To really address this needs a bit of work in AROS, but it is lower priority than other work currently underway. It has already been discussed over the last 2 years, about breaking the builds up so that the toolchain is not only built separately, but made available for download once built. We also discussed addressing the lack of proper SDK for AROS, which would be a necessary component with the toolchain, to produce binaries suitable for use. A little work has been started on these things, but the effort needed as mentioned, is lower priority than fixing the current issues.

One annoyance, that I have been investigating for a while, are the lockups during boot some people experience. They tend to be one of these very hard to replicate ones - only happening in certain hardware configurations, or driver configurations. So far Ive managed to fix a quite a large number, some of which that have been in the code base since 2000 or before! they include (but aren't limited to..)

# Code passing handles to stale data on the stack around, dating back to the early 2000s. Even intuition was doing it leading to some bizarre results.
# Hidds using the same attribute ID's for different classes attributes, due to the way they where allocated. While it doesn't sound obvious - this was very hard to debug. it only happened if the drivers started and some failed in specific orders, until eventually one driver would try accessing an object using different attributes that had the same value - confusing the life out of the code trying to figure out what it was meant to be doing. In the best case it had benign consequences, but at the other end of the scope it could end up trashing memory or just crashing AROS for apparently random reasons.
# ATA device just doing things because it wanted to try regardless of the consequences, even when the hardware was already shouting about things not working. This one is a common reason for AROS not booting, when it tries to talk to ghost devices and ends up hanging forever,
# the PS2 keyboard/mouse driver was conflicting with timer.device accessing the PIT timer, causing all sorts of weird erratic behaviour - or locking up for some people. ive rewritten the PS/2 driver now to use the timer.device itself, resolving the problem completely.

.. and then there are the ones I am still working way at.

A few years ago we introduced code to change the interrupt handling in AROS.  Up until that point, AROS had only supported the legacy PIC's interrupts. This was a bit of an issue, because we wanted to bring up other CPU cores now and also look at using other newer PC hardware features (MSI interrupts, etc). A lot of effort has been put in to rework this so that we now use all of the CPU's vectors, and have separate "drivers" for the different IC's in a PC. We have code for the old legacy PIC still, that can be used on its own or as a fall back, but we now also have an APIC interrupt controller that registers the remaining device IRQs and allows us to expose functions for routing, and modifying them outside the kernel. We also have an IOAPIC driver - but this one has been the cause of a few headaches to say the least.

According to the Docs, you set up your APIC, parse the ACPI tables to figure out what IRQs are in the system and how they should be delivered, find the IOAPIC(s) and configure them. Sounds Easy? Well after coding all that it appeared to work great .... except not everywhere.

It turns out even though its easy enough to set up the IOAPIC stuff using the information provided - a lot of systems lie about the PIT Timer, and how it is wired up. And this is where the problems occur. On some systems, as soon as the timer is used - it goes haywire resulting in an IRQ storm, so AROS doesn't do anything except continually handle the timer interrupt. Ive been scratching my head about this for the past little while, because it isn't very well documented (if at all) how to handle it. Sure, you can disable the IOAPIC (and aros has a "noioapic" boot option for this purpose), but how to make AROS work correctly with such flaky hardware?

One thing I tried, was changing the delivery details for the IRQ. I spent a considerable amount of time writing code to try and detect when the storm happened, and when it did - reprogram the route for the IRQ, and then let that run for a while and see if it fixed it. Suffice to say ... it didn't. I tried all 4 combinations of delivery options with the same result ... so what can it be? to be continued ....

Another problem that is more concerning is that AROS has a long standing and very hard to trigger issue where it loses tasks. Again, it seems to only really occur during boot - and only under very specific conditions which I have yet to pinpoint. When it does, one of 2 things happen.

#1 When input device is active, some interrupt occurs signalling a task. Input device resumes but it never switches away ... ever. I wrote a tool for HyperV that let me break into AROS when it hung, and dump the task lists, and sure enough there where tasks ready to run with pending signals, but for some reason input.device wasn't for letting go.

#2 some interrupt occurs from a hardware input device signalling "input.device" to run. The schedular switches out the running task ... but input.device is no where to be found ....

Again, Ive invested a massive amount of time and effort investigating this one so far but to no avail. From the schedulers point of view everything is working as it is meant to - but input.device has mysteriously been taken out of the picture.

Until I can get to the bottom of those 2 issues, I have very little interest on working on much else on AROS. I have an nvme driver in the works, which is basically complete - but debugging it is becoming tedious due to the random previously mentioned bugs.  the more debug that is enabled, the more likely AROS triggers one of them.

So that brings us to this next year. Atleast for the foresee-able future, I will be continuing to focus work on addressing the boot/stability issues already mentioned - and any others identified in due course. I'm hoping I can get them addressed relatively soon, because we really want to try and get the RasPI build working again (Ive recently fixed some build issues, so now there is just one problem linking the Roms remaining), and get the toolchain/SDK side addressed, and move forward with fixing the non-system applications (particularly posix like code) that seems to be unstable currently. All of the main AROS components themselves are running extremely stable - to the point that sometimes I find myself just launching them to reflect on how bad they behaved before - but until we have all our user land software running again I wont be happy.

... Lets hope this next year is a good one


(*) - It is hard to claim they are even running ABI v0. A decision was made by the AROS developers long ago before IcAROS, or others distributions where started - that due to the lack of any "real" ABI definition, and due to needing to make drastic ugly changes to the code to support binary compatibility with m68k, we would need to make changes to AROS so that we could clarify what the ABI was and not end up with an ugly mess of code to support the new targets.

Because this would likely lead to binary compatibility issues, and other breakages to cater for proper usage of the ABI, it was decided to snapshot the repository at that time and call it ABI v0 (as in it has no ABI), and move forward under ABI v1.

So regardless of what nonsense is spread on forums in a childish manner - ABI v1 "is" AROS.

Now, unfortunately, to appease the community that had grown around distributions built using ABI v0 - the whole reasoning for having ABI v0 a distinct separation was breached. Changes where backported breaking compatibility with the "existing" ABI v0 code anyhow, and not on one occasion - but almost as often as a new IcAROS release was put out. Why? So they could both claim it was ABI v0, while using all the changes that are meant to be part of the distinctly separate ABI v1 due to the breaking compatibility. You cant make this stuff up.