1 (edited by superg 2022-12-04 04:20:18)

EDIT 12/03/2022:

Final Audio CD offset correction algorithm (technical):
1. if there is one and only one possible disc write offset, applying which perfectly aligns audio silence (level: 0) ranges with TOC index 0 ranges, use it
2. else if there is non-zero data in lead-out and that data can be fully shifted out (left) without spanning non-zero data into lead-in, correct offset with a minimum shift required
3. else if there is non-zero data in lead-in and that data can be fully shifted out (right) without spanning non-zero data into lead-out, correct offset with a minimum shift required
4. else apply offset 0

Dump format specification notes:
Regardless of the applied offset, we always operate on LBA [ 0 .. lead-out ) range when we perform a track split. This is also true for data discs. If, as a result of mastering, there is non-zero data either in lead-in ( -inf .. 0 ) or lead-out [ lead-out .. +inf ), this is preserved in separate files. For discs with data tracks, non-zero data means the descrambled data portion of a sector (fields such as sync, MSF, mode, subheader, ECC/EDC is excluded). The resulting lead-in/lead-out files should be trimmed to the minimum non-zero data size, lead-in from the front and lead-out from the back, but they should remain sector-aligned (size divisible by 2352).

Considerations for matching Audio CD with different write offsets
For each disc, dumping software can generate a checksum/hash for the non-zero sample span of the data e.g. aligned to 4-byte CD sample size. Such hash can be used for the disc identification purpose as well as for Audio CD different write offset matching as for such cases the hash would be the same. This applies to both audio and data discs. I suggest to use a longer SHA-1 hash and not CRC32 just to be future proof. It's quite likely that we will get a couple of CRC32 collisions in ~100K discs currently in the DB. As a side perk, it can also serve as a unique disc ID which can be easily looked up in the database, if such capability will be ever implemented at redump.org. In any way, I would like to have such a hash in Audio CD entries just for write offset matching.

END OF EDIT, the rest information here is kept for the reference and archival purpose.




This topic is to clarify and decide on how we manage disc write offset for audio discs.

The current status quo is that we always dump audio discs with offset 0 as there is no reliable reference point in the audio stream (in contrary to a data track) which can be used to determine the offset. This approach has a number of disadvantages, such as:
* shifted audio data in pre-gap / lead-out which we don't currently preserve
* ocassional imperfect track split which cuts in the middle of audio tracks e.g. you hear a bit of the next track in the end of a current one or previous track in the beginning of the current one

I believe I solved both of these problems in redumper. Let me define some terminology first.
Perfect Audio Offset is a disc write offset which guarantees that no data is shifted into lead-out and guarantees track split which doesn't cut into a middle of a track.

Perfect Audio Offset implementation details
For a given audio disc, I build a silence map based on TOC/subchannel information, essentially it's INDEX 00 CUE entries which are almost always empty (silent). As a next step, I build a similar silence map based on an audio stream. Finally, For each offset within [-150 .. lead-out] constrained range, I try to line up these two maps in a way that TOC/subchannel based one fits into an audio stream. If it fits, it's a perfect audio offset.

The current audio offset logic
favor offset value from perfect offset range if available
if multiple perfect offset values available (range), try to shift data out of pre-gap (if needed) if still within a perfect range
otherwise if no data in pre-gap, favor offset 0 if it belongs to the perfect range
finally, if offset 0 doesn't belong to the perfect range, use a value closest to 0 within a perfect range
if no any perfect offset available, try to shift data out of lead-out and pre-gap (only if we can get rid of full pre-gap data) if that doesn't lead to data loss.

or pseudocode:

if(perfect_audio_offset_available)
{
    if(perfect_offset_single_value)
        use_offset_value();
    else if(perfect_offset_value_range)
    {
        if(data_in_pregap)
        {
            if(enough_space_to_get_rid_of_whole_data_in_pregap AND still_within_perfect_range)
                move_minimum_data_right();
        }
        else if(zero_offset_belongs_to_perfect_range)
            use_zero_offset();
        else
            choose_the_offset_closest_to_zero();
    }
}
else
{
    if(data_in_leadout)
        move_minimum_data_left();
    else if(data_in_pregap AND enough_space_to_get_rid_of_whole_data_in_pregap)
        move_minimum_data_right();
}

Pre-Gap notes
Based from the discs I own and discs I have an access to, most audio discs with data in pre-gap is not a result of a write offset but rather a way it was mastered. It's often values close to silence but not zeroed and sometimes it's a part of a hidden track (HTOA). In cases like this, there is no way to move out that data fully out of pre-gap without it shifting into lead-out as there is not much space (it's common to have 1-2 seconds of pre-gap data audio which is a lot). I can definitely say it worth preserving by extending track 1 fixed 150 sectors back for all such cases, with, optionally, marking that in CUE? Before anyone says it's stupid, DIC already does a similar track 1 extend for CDi-Ready discs.

Statistics
With new method, I redumped all my audio discs and sadikyo redumped some of his. I shared new dumper version with Intothisworld and he shared it with ppltoast but I have yet to get some results from them. The current merged detail statistics is available here:
https://docs.google.com/spreadsheets/d/ … sp=sharing

TL;DR
73 discs - match redump (offset 0 is one of the perfect offsets)
9 discs - no perfect offset found (no distinctive silence in index 0 or no index 0 entries at all), offset 0 used so matches redump
7 discs - only one perfect offset found and it's true offset value used to master disc with, will require DB update
27 discs - perfect offset range excludes 0, will require DB update
3 discs - have pre-gap data which is impossible to fully shift out and it's not currently preserved, will require DB update

Side Effects
Given method works really well for PSX GameShark Update discs and Jaguar discs without relying on magic numbers and we get a perfect splits there too.

More Side Effects
If we left-align (or right-align) offset based on a perfect offset range, we will be getting matches for audio discs with the same data but different write offset, such as:
http://redump.org/disc/77301

I would like to hear your opinions on what do you think of it, let's discuss.

superg wrote:

This topic is to clarify and decide on how we manage disc write offset for audio discs.

The current status quo is that we always dump audio discs with offset 0 as there is no reliable reference point in the audio stream (in contrary to a data track) which can be used to determine the offset. This approach has a number of disadvantages, such as:
* shifted audio data in pre-gap / lead-out which we don't currently preserve

Wrong, dumpers must tweak the offset value to include all the data if possible, also DIC should do this automatically, IIRC. Sometimes both pregap and leadout have data, needs manual analysis in this case (usually it's about tweaking the offset to include the leadout data + extracting the pregap, like it's done in http://redump.org/disc/6695/, but I guess it depends on which of the pregap and the leadout portions of data is smaller - the smaller one needs the offset tweaking, the larger one needs the extraction).

superg wrote:

* ocassional imperfect track split which cuts in the middle of audio tracks e.g. you hear a bit of the next track in the end of a current one or previous track in the beginning of the current one

There are thousands of games with the same problem, so it's mostly about their bad mastering rather than direct offset issues.

superg wrote:

Perfect Audio Offset implementation details
For a given audio disc, I build a silence map based on TOC/subchannel information, essentially it's INDEX 00 CUE entries which are almost always empty (silent). As a next step, I build a similar silence map based on an audio stream. Finally, For each offset within [-150 .. lead-out] constrained range, I try to line up these two maps in a way that TOC/subchannel based one fits into an audio stream. If it fits, it's a perfect audio offset.

Crazy. I think that matching EAC and AccurateRip databases for the 'offset 0' audio tracks is a better benefit compared to some custom aligning. Also, what happens if all the tracks have big portions of silence at their beginnings and endings? Does it shift the offset to force the first track to start from its first nonzero byte?

superg wrote:

Side Effects
Given method works really well for PSX GameShark Update discs and Jaguar discs without relying on magic numbers and we get a perfect splits there too.

It will NOT work for Jaguar discs at all, because its 'data' (the INDEX 01 of the second track of the second session) must start with its magic (according to the official Jaguar CD format description documents), but if you align according to the magic, the first track of the first session for a good half of Jaguar CD releases gets shifted to the first pregap - and in your case, if you align according to the non-zero data and silences, its magic will be way off its proper position.

Jackal wrote:

The logic as described sounds good to me. I'm assuming the offset correction is always in samples, so a multiple of 4 bytes?

Yes, I analyze it as multiples of 4 bytes (two signed 16-bit sample values).

Jackal wrote:

I would also like to know if your method is able to find the true "perfect" offset for discs such as these: http://redump.org/discs/quicksearch/wri … /audio-cd/ + http://redump.org/discs/quicksearch/det … region/Eu/ where aligning either left or right to the first byte gives a "common" write offset value that is consistent with the ringcode.

Don't have any from this list unfortunately, you can try redumper if you have any of these, for each audio disc it will output something like

detecting offset
audio silence detection... done
perfect audio offset (silence level: 5): [+736 .. +22675]
disc write offset: +736
warning: pre-gap audio contains non-zero data, preserving (session: 1, leading zeroes: 87473, non-zeroes: 727/88200)
detection complete (time: 75s)

And after that "redumper split --force-offset=22675 --overwrite" to align it.

Jackal wrote:

Maybe this should be an extra step? So validate for common offset values when shifted to first or last non-zero byte before going with 0 offset or whatever else the logic would decide on. Would be great if you could test that using some of the images.

I also thought about something like this. Left shift (or right shift) to the first non zero byte and calculate some hash and use it as a universal checksum which allows us to match same audio discs mastered with a different offset. For instance for two such discs from here: http://redump.org/disc/77301, if I right-align, both dumps match.

F1ReB4LL wrote:

Wrong, dumpers must tweak the offset value to include all the data if possible, also DIC should do this automatically, IIRC.

True, we want dumpers to do that, but the status quo is that they really don't. They do whatever DIC does by default.

F1ReB4LL wrote:

Sometimes both pregap and leadout have data, needs manual analysis in this case (usually it's about tweaking the offset to include the leadout data + extracting the pregap, like it's done in http://redump.org/disc/6695/, but I guess it depends on which of the pregap and the leadout portions of data is smaller - the smaller one needs the offset tweaking, the larger one needs the extraction).

Even if we separately extract pregap.bin, it can be automated. I don't necessarily like this approach as pregap data usually belongs to track 1. It's fine if it's "almost silence" but Intothisworld owns a disc where it's HTOA in there.

F1ReB4LL wrote:

There are thousands of games with the same problem, so it's mostly about their bad mastering rather than direct offset issues.

True, but at least we have a way for them to determine a proper offset value using data track.

F1ReB4LL wrote:

Crazy. I think that matching EAC and AccurateRip databases for the 'offset 0' audio tracks is a better benefit compared to some custom aligning.

We can still have hashes or some other info calculated for offset 0 to satisfy the compatibility things like EAC/AccurateRip.

F1ReB4LL wrote:

Also, what happens if all the tracks have big portions of silence at their beginnings and endings? Does it shift the offset to force the first track to start from its first nonzero byte?

No no, it's not like that. The current logic doesn't align to first non-zero sample. I calculate perfect offset range, but if offset 0 belongs to that range, it's has a higher priority. The disc you describle with the big portion of zeroes at the beginning and end will have a big slack which will most likely include offset 0.

F1ReB4LL wrote:

It will NOT work for Jaguar discs at all, because its 'data' (the INDEX 01 of the second track of the second session) must start with its magic (according to the official Jaguar CD format description documents), but if you align according to the magic, the first track of the first session for a good half of Jaguar CD releases gets shifted to the first pregap - and in your case, if you align according to the non-zero data and silences, its magic will be way off its proper position.

It works for the Myst Demo I own. I can't speak for the generic case. My Myst Demo aligns to that magic using the logic I described:

detecting offset
audio silence detection... done
perfect audio offset (silence level: 0): [-2201 .. -401]
disc write offset: -401
warning: pre-gap audio is incomplete (session: 2, errors: 371)
detection complete (time: 14s)

As offset 0 is not included in the perfect range here, -401 is closest to 0 so it gets chosen and that perfectly coincides with aligned by magic.
By default I have it implemented using magic, like DIC does. I just remember you mentioned you're not fond of the current way so if you think we can do something better for Jaguar CD - let me know.

But the algo is still useful for PSXGS as if you align by magic, it cuts through other data portions.

Jackal wrote:

So my question for this discussion is:
Do we really need a "perfect offset" correction if there is no data loss with 0 offset (and no common write offset can be detected)? After all, we wont know how the disc was mastered and if the gaps are supposed to be silent.

Yeah, this is legit question, if no data is lost, we just end up with imperfect split sometimes. But that technically can be corrected just from BINs if needed.

Ok, so some cool down period passed, let's regroup.

Let's say, we remove perfect track split out of consideration. I'll have redumper output perfect audio offset range anyway just for reference but will not apply it by default. The concept would be super useful for perfectionist audio folks so I'm happy I have that implementation for them.

What's basically left for us @redump is to define a clear approach to how we handle audio discs with non zero pre-gap and lead-out data.

I guess we have a consensus on the next two rules:
1. if there is non zero data in lead-out and that data can be fully shifted out of there (left) without spanning non zero data into pre-gap, correct offset with a minimum shift required
2. if there is non zero data in pre-gap and that data can be fully shifted out of there (right) without spanning non zero data into lead-out, correct offset with a minimum shift required

How to define situations where it's impossible to shift out data from lead-out/pre-gap (data is wider than allocated TOC space for it), several options:
1. Use offset 0 as a base, dump non zero pre-gap data to pregap.bin, dump non zero lead-out data to leadout.bin
2. Fully shift data out of lead-out if needed, dump non zero pre-gap data to pregap.bin
3. Use offset 0 as a base, prepend non zero pre-gap data to the first track, append non zero data to the last track
4. Fully shift data out of lead-out if needed, prepend non zero pre-gap data to the first track

My insight:
I don't like (1) or (2) because preserving data to external pregap.bin and leadout.bin files will usually be "lost" because it's unreferenced from the CUE-sheet and I don't see a good way of linking the files to the other cue/bin set.
I also don't like (1) or (2) because in all cases that I saw, non zero pre-gap data genuinely belongs to the first track. It's either HTOA index 0 entry or non zeroed mastering "silence" which is still part of the first track.
Lastly, I didn't find a proof anywhere in the Red Book standard that pre-gap data of an audio disc should be zeroed. I found such a requirement only for the pre-gap of a data track in ECMA-130.

That said, I personally would lean towards (4).

Let me know what do you think.

7 (edited by superg 2022-07-15 22:57:07)

Jackal wrote:

And where does this discussion leave us with discs like those PSX with audio in lead-out? I'm against appending lead-out data to the last track, because it's just not part of the main tracks. Also, I dont think we should shift audio data out of the lead-out for mixed mode discs, because the combined offset correction overrules it. So the only solutions for such discs imho is to put the data in leadout.bin or do nothing with it.

I totally agree with that. I purposely haven't mentioned that yet to get Fireball's opinion on pregap.bin/leadout.bin and focus on one issue at a time wink.
In all situation where we have an offset determined by a data track, we shouldn't extend last track, saving leadout separately for cases like this would be possibly a best solution?
TL;DR, shifting data out of lead-out / pre-gap should happen only for the discs where we can't figure out an offset in a deterministic way (based on data track sync/MSF or anything similar).

I can share that at least conceptually, I definitely agree with superg's general direction and what he is trying to accomplish here with a solution for the audio cds.

As for some of the specific concerns raised, some of it is a bit outside the scope of my technical knowledge, so it's hard for me to provide valuable input.

I'm very curious to hear what fireball has to say about the recent discussion.  I do think that discussing and addressing this before too long, is pretty important.  As someone else said as well, I know some people are holding back on preserving / submitting audio discs, because they know there are pending issues that need fixing.

Thanks to all of you for your amazing work on this, and especially superg, for digging deep into these issues and bringing this to the forefront.

Just checking to see if there are any thoughts, concerns, updates about the audio cd issue? I know this took a bit of a back-burner due to the site concerns, of course.

Do we have any sort of consensus on how to proceed here?

I want to try some CD's suggested by Fireball to see if we're covered there.

Just an update on this, I ordered a couple more Japanese audio CD's which have audio in pregap/leadout which Fireball suggested to check.
They are on the way here. After I receive them and redump, I will report my findings here and we will reiterate on the final audio CD rules and preservation format as I really want this finalized.

Ok, here's another update based on the discs I purchased and dumped myself. Initially that was asked by Fireball as he has experience with them but this is really good shitty audio usecases and some general info on what we can encounter.

Dracula: Music Collection
http://redump.org/disc/14890/

Nothing special about this disc other than two masterings differ by the offset, same happens here: http://redump.org/disc/77301/
Specified possible write offsets +390 and +684 cut into tracks, I don't think they are relevant. True offset should be in perfect range [-4926 .. -3731] based on my redumper algo.

We can offset match such discs using two possible approaches:
1. Always shift each dump left-most or right-most so regardless of the offset each dump will match each other. Pros are that it's totally automatic. But a big con is that we will need to redump all audio entries which is unrealistic.
2. By introducing something I'd call a "universal checksum". Basically upon successfull dump, redumper calculates right-most (or left-most) data shift checksums and outputs crc/md5/sha-1 checksum the usual way: <rom offset="+123" size="68701920" crc="060bb712" md5="47393f188ff00fafbdf77b5eb771dbd3" sha1="ef991d90b284b0c92ab2b4eb0eb77942e32bb98c" /> and notes the offset value needed to right-most/left-most shift. We store this information somewhere for the future reference. Every time a potential different offset verification title is dumped, we compare universal checksums and if they match, we add another ringcode line to the matched entry with deduced offset in relation to the previous entry.
The pros of such approach is that we don't dramatically change the way we dump comparing to method (1) so already added dumps stays the way they are. The cons are that it's not 100% automatic.

Personally, I'm in for (2), this is easy to implement and we can set a precedent that will be used in the audio dumping world.


Tenbu Mega CD Special Mini Audio CD
http://redump.org/disc/6695/

This is very clean, redumper shifts out 13 samples left from lead-out and everything still fits in pre-gap nicely.


Micronet Music Collection Vol. 1
http://redump.org/disc/30335/

This has huge non-zero chunk (22006 samples or 88024 bytes) in the lead-out. According to the proposed rules, we shift the data out of there left by the amount of 22006 samples. This will get rid of the lead-out data but will spill over 16 non zero samples into pre-gap. Not ideal but it's close to the truth and perfect range for this disc is [+8423 .. +21155]. IMO the best solution given that we preserve whole data in one file.


Oyaji Hunter Mahjong
http://redump.org/disc/39873/

This is exactly as comments say. I stand corrected, this is more horrible. There is 68 sectors of data in lead-out, there is 150 sectors of data in-pregap and there is ~670 sectors (1574524 bytes) of non-zero data in TOC before pre-gap. I capture everything in redumper and it seems to be consistent in the scram file. Offset 0 is used by default and I extract leadout.bin as is and getting same checksums calculated by Fireball, everything matches. 150 sectors data in pre-gap are fully preserved in Track 1, but what to do with the data in TOC? I don't know. Well, in fact I will propose a solution later but that requires everybody to be open minded smile


Other Considerations
Now, with all these examples in mind, I have a modified idea which will let us capture every byte and be mostly redump compatible (including site and the current DB).
What if we never shift audio e.g. always use offset 0 but store spillover lead-in and lead-out data in separate tracks? Something like pregap.bin / leadout.bin that we don't currently "preserve" but in a more generalized way.
This fits in a very elegant way with lead-out as internally, leadout is just another disc track with AA track number and it has all the track properties such as mode, data, positional subchannel etc. As in reality lead-out track spans the whole disc, we trim all the zeroed data and make it sector aligned. If there is no data in the lead-out - we don't create a file and that will satisfy 99% of all the use cases. But at the same time we accomodate for the case where there is something there. As other two big benefits I see that we can preserve Dreamcast logo data which is session 1 lead-out and I sometimes see lead-out audio spillover in PSX discs where it's not currently being preserved in any way. We can have the track defined in the CUE-sheet with all the appropriate properties thus this data will be preserved by "data hungry" preservationists, whoever they are. The similar approach will go for the non zeroed lead-in track. If it's empty, like it usually is in 99% cases, it won't exist. If it does, it's zero trimmed at front and sector aligned. No data is lost ever, redump track compatibility is all time high as it's CUE tied and we add it to the website like a usual tracklist with hashes.


Oyaji Hunter Mahjong example:

FILE "Oyaji Hunter Mahjong (Japan) (3DO Game Bundle) (Track 1#00).bin" BINARY
  REM REDUMP LEADIN
  TRACK 00 AUDIO
    INDEX 00 00:00:00
FILE "Oyaji Hunter Mahjong (Japan) (3DO Game Bundle) (Track 1).bin" BINARY
  TRACK 01 AUDIO
    INDEX 01 00:00:00
FILE "Oyaji Hunter Mahjong (Japan) (3DO Game Bundle) (Track 2).bin" BINARY
  TRACK 02 AUDIO
    INDEX 00 00:00:00
    INDEX 01 00:12:45
FILE "Oyaji Hunter Mahjong (Japan) (3DO Game Bundle) (Track 3).bin" BINARY
  TRACK 03 AUDIO
    INDEX 00 00:00:00
    INDEX 01 00:09:60
FILE "Oyaji Hunter Mahjong (Japan) (3DO Game Bundle) (Track 4).bin" BINARY
  TRACK 04 AUDIO
    INDEX 00 00:00:00
    INDEX 01 00:11:63
FILE "Oyaji Hunter Mahjong (Japan) (3DO Game Bundle) (Track 4@AA).bin" BINARY
  REM REDUMP LEADOUT
  TRACK 05 AUDIO
    INDEX 01 00:00:00

Or variation naming/numbering schemes. I specifically chosen # and @ for filenames as the symbols sort before and after number entry thus you get a nice look and this scheme supports multisession pre-gaps/lead-out as we don't have to renumerate anything.
We could use simply "Track 00" for lead-in and "Track 05" for lead-out but there has to be a good way of supporting this for multisession discs where there can be session lead-out/lead-in between two tracks with adjacent numbers.
Or, we don't have to add it to the CUE-sheet at all but in my opinion having it there ties all the files together for the preservation. We could even have special redump CUE tags for that, plenty of ways.

13 (edited by Jackal 2022-11-26 12:11:02)

My final vote would be to correct offset whenever it's necessary, practical and possible, so follow the base rules that we previously discussed:

0. If there is no non zero data in pregap/lead-out, use 0 offset. Unless it's possible to manually detect the write offset with a reasonable degree of certainty, in which case combined offset correction can be used.

1. if there is non zero data in lead-out and that data can be fully shifted out of there (left) without spanning non zero data into pre-gap, correct offset with a minimum shift required
2. if there is non zero data in pre-gap and that data can be fully shifted out of there (right) without spanning non zero data into lead-out, correct offset with a minimum shift required

Whenever a disc is dumped with offset correction, this should be documented in comments.

And then for the rare headache cases discussed in your last post where it's impossible to shift out data from lead-out/pre-gap (data is wider than allocated TOC space for it):

3. Use 0 offset and preserve relevant non-zero data in separate pregap.bin or leadout.bin. I don't see any advantage in trying to include this data with the main dump through a custom cuesheet format or whatever, but if it's decided otherwise, that's fine by me.

And for the DC / PSX or other discs that have missing relevant TOC / pre-gap / lead-out data, we should also preserve this data in separate files (offset corrected if possible).

As for offset matching and "universal" checksums: Audio checksum databases like AccurateRip and CUETools are already ignoring leading and trailing zero bytes, so they are essentially already storing "universal" checksums? I think this is beyond the scope of the Redump project and would require too much work and too many changes.

Guess we still need to figure out how add the separate files in the database, with iR0b0t not around. Maybe resort to storing .dat or checksums in comments for now, similar to Xbox PFI/DMI/SS.

Jackal wrote:

0. If there is no non zero data in pregap/lead-out, use 0 offset. Unless it's possible to manually detect the write offset with a reasonable degree of certainty, in which case combined offset correction can be used.

1. if there is non zero data in lead-out and that data can be fully shifted out of there (left) without spanning non zero data into pre-gap, correct offset with a minimum shift required

2. if there is non zero data in pre-gap and that data can be fully shifted out of there (right) without spanning non zero data into lead-out, correct offset with a minimum shift required

This is clear.


Jackal wrote:

Whenever a disc is dumped with offset correction, this should be documented in comments.

The non-zero offset will be specified in the ringcode entry, wouldn't that be enough?


Jackal wrote:

And then for the rare headache cases discussed in your last post where it's impossible to shift out data from lead-out/pre-gap (data is wider than allocated TOC space for it):

3. Use 0 offset and preserve relevant non-zero data in separate pregap.bin or leadout.bin. I don't see any advantage in trying to include this data with the main dump through a custom cuesheet format or whatever, but if it's decided otherwise, that's fine by me.

Yes, now I think this would be the best course of action. Separate files, size is sector aligned.


Jackal wrote:

And for the DC / PSX or other discs that have missing relevant TOC / pre-gap / lead-out data, we should also preserve this data in separate files (offset corrected if possible).

I already have this implemented in redumper, just have to walk over it and do some checks.


Jackal wrote:

As for offset matching and "universal" checksums: Audio checksum databases like AccurateRip and CUETools are already ignoring leading and trailing zero bytes, so they are essentially already storing "universal" checksums? I think this is beyond the scope of the Redump project and would require too much work and too many changes.

AccurateRip and CUETools are track based and they do it mainly to match tracks as far as I know - this is overkill for us.
What I was saying is not exactly that. In redumper, only for audio cd's I can generate let's say a SHA-1 hash of a non-zero data span, one hash per disc. That would be in the log-file. For a new submission of an audio disc, we add that to comments, example:
Universal Hash: 958b5a41456c90cb5c6df8d676c3d2d90a940609 (-647)
For the subsequent verifications of the same disc with a different write offset these hashes will match and this will be an indicator to us not to add a new disc but add another ringcode line in an existing entry. Just don't tell me we have too many things in comments (we do) but out of all stored and unneeded crap like volume labels, this particular thing would be the most useful.


Jackal wrote:

Guess we still need to figure out how add the separate files in the database, with iR0b0t not around. Maybe resort to storing .dat or checksums in comments for now, similar to Xbox PFI/DMI/SS.

By the way, can we add extra files to XML list but exclude them from CUE-sheet, would that work?

I updated the first post.

I don't really have much to add to this I suppose, although thanks for moving it to general. Very interesting to follow your discussion on the topic. You guys have already done most of the hashing out it seems like, so I'm kind of in a "too little, too late" sort of position. But I do still have a fairly sizable unshared spreadsheet where I've documented the majority of my testing and overall work with this problem. Seems a waste of a lot of energy and hours to not at least make it available somewhere where it could potentially be useful, rather than just letting it rot away in my Google Drive.

But I don't want it to get lost in the shuffle of making site updates with iRobot, because I think it does deserve a cursory perusal at the very least. So I'll wait to post the actual document, but I want to give this notice of my interest in the forum post now, so the topic doesn't drift completely out of mind. Thank you. Cheers.

I think I already have your spreadsheet Into. Unless you added a lot since. Post the link here anyways.

18 (edited by Intothisworld 2022-12-17 23:31:26)

Yeah I did send you a link at one point. But you were pretty busy with redumper. I didn't have much of a chance to talk with you about it in any real nuanced or in-depth way like I was hoping to. And yeah there's probably been a lot added since then. It's been for the most part a perpetual work in progress since around April.

Anyway, I wasn't going to, but I think screw it, I'm just going to go over my general thinking process on the topic, and see what your guys' thoughts are in return. I've touched on these things to a certain extent with sadikyo, bikerspade, and superg like I said, but I'm very curious what input others like Jackal and F1ReB4LL might have as well.

First, just a quick general explanation of the spreadsheet: I started it originally to keep track of all the audio CDs I was finding that had non-zero data in the lead-in and lead-out, in order to have plenty of data and test cases to work from once testing started in earnest on putting audio CD offset auto-detection into practice. But one thing that kept bothering me was how extreme these supposed "offset" values were on some discs. The highest of these values were throwing off the track alignments on some of my CDs by up to 2 seconds... I don't think it's physically possible even, for glass mastering equipment to offset CD data by that much... The most likely explanation for these huge amounts of overflow data, as far as I can tell, is audio data being improperly trimmed by the audio mastering engineer at the studio, coupled with negligent red book compliance screening at the manufacturing facility.

But at the very least, it's evident there's some difference in origin & nature happening there, and it's just a matter then of figuring out a reliable way to distinguish between these two types of overflow data: One being overflow data due to standard, run-of-the-mill manufacturing offset, and the other overflow data due to sloppy mastering. So anyway, working on the spreadsheet, with that question in mind, I started just collecting data on all the discs I could, hoping that some useful patterns might present themselves.

I also, to establish some context for the problem, started keeping track of other previously confirmed offset values (from data track-based discs), along with those discs' ringcodes (with a focus on mastering SIDs). The correlation between these two factors is obviously not consistent enough to ever make any conclusive judgments from, but my thinking was that, at the very least, maybe this data could be used as a "quality check" tool of some kind. So when we encounter one of these extreme -40,000 or whatever "offsets," we could simply ask: "Okay, is there a basis for this offset in question in the realities of the manufacturing process that led to the creation of this disc?" i.e. "Does the LBR that created the glass master for this disc have any history of creating any other glass masters with this same strange offset?" With the very limited evidence available from the data contained in the audio CDs themselves, I thought that at the very least, this could be a very useful practical grounding for when we're approaching these types of very strange edge cases.

I also started inspecting CDs using superg's pregap "perfect offset" method and keeping track of the results from that. That was a huge revelation and as far as I know, is the only way to directly perceive on an audio CD itself what its original, true manufacturing offset was. The only downside is that it is unfortunately not readily visible on most audio CDs. It takes a very special arrangement of the data to be visible, and even when it is, oftentimes the evidence is not entirely clear-cut. But ultimately, I was able to use the pregap method to determine with reasonable confidence the true offset of about 10-15% of the CDs that I inspected. This data is all documented in the spreadsheet as well, including the track-by-track breakdown of each disc that I inspected that way.

There are also some other things recorded, such as PVDs for some data discs, notes on offset-related "alt" pressings (i.e. CDs that are identical to each other in all ways but offset) including the offset values that separate them, and various other bits of info.

Before I post the spreadsheet though, I want to first preface with explaining some of the primary concerns that have occurred to me as I've been putting all this data together.

The biggest thing that concerns me, I'll just say it plainly--and I don't mean it as a criticism towards anyone or anything, more just an observation of the ambiguity/difficulty of the problem--is that we have such a clean, and tight, and conclusive method for determining the original manufacturing offset of data track CDs, but then now that we come to audio CDs we may very well be left just resorting to somewhat of a "good enough" type of approach. There's so little evidence available to us to determine the true value for each disc, that it is almost justifiable to simply say, "well let's just shift what we can, capture all the data, and call it good."

The thing about that is though, with data track CDs we can of course determine the true offset directly and unambiguously, but even if we couldn't, no matter what the offset value we applied was (as far as I know, correct me if I'm wrong), it would still have no real tangible effect on the playback of the disc image itself; The file system is still accessed the same way and ultimately nothing in the user experience is changed.

With audio CDs on the other hand, when you adjust the offset between the audio data and the subcode data, it has a very direct and tangible effect on the playback of the CD. Namely, it changes the point at which the audio on the album starts, when it ends, as well as all the start and end points of the tracks in between, i.e. essentially it shifts the entire framework of the album. This is the type of thing that music collectors and enthusiasts are going to notice and care about when they're perusing our database, or listening to their favorite albums that they've dumped and preserved using our methods. Particularly those massive 20,000+ sample offsets, but even the smaller random values (e.g. -11, -17, etc.), being arbitrary like that, will likely irk many music purists and preservationists, if they can't be justified in any foundational way. Audio CD offsets are almost totally ambiguous like I said and as we well know, but I think that if anything, because of all those reasons, we should be being even more careful, even more restrained and discretionary than we are with data track CDs, when it comes to the types of offset values that we allow to be applied to them.

Couple that with the fact I mentioned earlier about some of those bigger overflow data values likely not even being a result of offset at all... Anyway I guess to sum up my basic point, in my opinion the number of samples that happen to be protruding from the program area is not justifiable evidence upon which to determine and correct for the manufacturer offset value, and due to the fact that the applied value can and does have a tangible effect on the accuracy of the playback experience that is preserved, we should be exercising all due restraint in making these types of changes to audio CD dumps.

I have a few more things to say in regards to this (and even a few ideas that might be workable to enhance our accuracy), but I don't want to bombard you with everything all at once, and I'd like to hear your thoughts on these specific concerns. I'll share the spreadsheet itself in my next post, but for now, thank you very much for reading. Cheers.

[EDIT: Got ignored. Well for posterity's sake, and so it doesn't go to waste, here's the spreadsheet. Audio CD test data, as well as sort of a rough draft of some other things I was working on. Maybe will come back and finish at some point just for fun.]

https://docs.google.com/spreadsheets/d/1Gknkby9nF3hW5CpVeVsPFJCn4gyADhLR8HF0LNRpgMU/edit?usp=sharing