For most CDs, unscrambled images are more useful on a day-to-day basis than scrambled images. Fortunately, an unscrambled image can be used to exactly or nearly exactly generate the scrambled data, with maybe just a few KB of differences from the actual scrambled data returned by the drive. (In the case of intentional errors or incorrectly mastered sectors that were replaced with dummy sectors during the initial descrambling step, the re-scrambled data will differ from the original scrambled data returned by the drive. For sectors without any errors, though, the re-scrambled data should exactly match the scrambled data received from the drive.)

For archival purposes, I'm storing both unscrambled and scrambled images for all the CDs I dump. This ends up taking quite a lot of storage, because each disc is stored both fully scrambled and fully unscrambled. What I'd ideally like to be able to do is just keep unscrambled images, and, alongside them, a difference file indicating what (if any) bytes differ when the unscrambled data is used to generate the scrambled data. This would enable a tremendous space savings since it would still enable full reconstruction of the scrambled data, but it would store only those bytes that cannot be regenerated from the unscrambled image.

Is there any existing software / image format that enables this type of storage? It seems like it would potentially be a nice feature for Aaru, though I don't believe it currently supports this. I've thought about maybe writing a utility to do it myself, but it'd feel much tidier if it tied into something that the community was already using for archival. Maybe if Aaru can't do it natively, it would be possible to add some custom metadata field to Aaru images that encodes any differing bytes?

Does anyone have any thoughts / insight? My motivation is that some discs seem to embed data inside of error sectors (e.g., sarami has pointed out previously that some disc has lines from the poem Jabberwocky stored in the erroneous sectors), and this data is thrown away when the descrambled image is built. I'd like to keep that data.

If you're using one of the latest DIC versions, it has both scrambled and descrambled image checksums in the "_disc.txt" file, you can scramble the descrambled image back and verify its checksum, if it matches - no reason to store the scrambled file itself.

F1ReB4LL wrote:

If you're using one of the latest DIC versions, it has both scrambled and descrambled image checksums in the "_disc.txt" file, you can scramble the descrambled image back and verify its checksum, if it matches - no reason to store the scrambled file itself.

I've thought about doing it this way and then just storing deltas for when the scrambled data doesn't match exactly (since the deltas would allow creation of the scrambled data from the unscrambled data and would typically be much smaller than the entire scrambled image). That's probably what I'll end up doing, but I also may look into adding metadata to an archival format like Aaru if it's possible.

I wanted to make sure there wasn't a better way to do it using some existing, standardized approach before I came up with my own solution. It sounds like there's not, unfortunately.

scsi_wuzzy wrote:
F1ReB4LL wrote:

If you're using one of the latest DIC versions, it has both scrambled and descrambled image checksums in the "_disc.txt" file, you can scramble the descrambled image back and verify its checksum, if it matches - no reason to store the scrambled file itself.

I've thought about doing it this way and then just storing deltas for when the scrambled data doesn't match exactly (since the deltas would allow creation of the scrambled data from the unscrambled data and would typically be much smaller than the entire scrambled image). That's probably what I'll end up doing, but I also may look into adding metadata to an archival format like Aaru if it's possible.

I wanted to make sure there wasn't a better way to do it using some existing, standardized approach before I came up with my own solution. It sounds like there's not, unfortunately.

Scrambling is a simple math involving shift register, it's a trivial implementation.
Delta is ineffective here and it will be as big as the data track (data is scrambled, audio is unscrambled).
The most annoying thing in this conversion process is to actually know which sector is audio and which is data, scm doesn't have that info so you will have to extract it from TOC to be absolutely sure (you can go by data sync header but there is no guarantee there won't be such sequence in audio sector).

5 (edited by scsi_wuzzy 2022-04-07 13:23:36)

superg wrote:

Scrambling is a simple math involving shift register, it's a trivial implementation.
Delta is ineffective here and it will be as big as the data track (data is scrambled, audio is unscrambled).
The most annoying thing in this conversion process is to actually know which sector is audio and which is data, scm doesn't have that info so you will have to extract it from TOC to be absolutely sure (you can go by data sync header but there is no guarantee there won't be such sequence in audio sector).

I didn't mean a delta between unscrambled and scrambled. What I meant is that it's not necessarily guaranteed that a re-scrambled data track will match the original scrambled one read from the disc. Specifically, error sectors in the descrambled image are replaced with dummy data, so, when those sectors are re-scrambled, you won't get the original scrambled data back.

Because of this, I was thinking of using deltas between the original scrambled data and the rescrambled data. For most any unprotected disc, the files will match exactly. For discs with errors, the rescrambled data for error sectors won't match the data returned by the drive, so the delta will just have to store the differences for these sectors.

I was thinking this would be a better solution than just storing both the unscrambled and scrambled data, as it would be much smaller. For many discs, it would basically eliminate the storage of the scrambled data altogether, since you can recreate it. For other discs, it would require just storing a delta large enough to reconstruct those sectors that don't rescramble to the original data, which, for most discs, is only a few hundred sectors at most.

Regarding the TOC concerns, I was thinking maybe something even simpler than that. I don't necessarily care if I know exactly which sectors, according to the TOC, are data or not. I just want to have a compact way to recreate the original scrambled data from the drive while storing the more useful unscrambled image for day-to-day use. I was thinking it should be possible to just go through the unscrambled image looking for the 00 FF ... FF 00 sync pattern at each sector offset and then doing the XOR with the scrambling table for all the sectors found that way. If there's no sync, just write out the sector as-is. Then, after having done that, compare this rescrambled image with the original scrambled image that DIC read in from the drive. If the images matche, nothing new needs to be stored except a note that the scrambled image can be trivially recreated from the unscrambled image, and we can delete the scrambled image. If it doesn't match, make a delta between the rescrambled image and the original scrambled one read in from the drive, and then delete the original one. Then, we can in the future recreate the original by rescrambling the unscrambled image and then applying the delta.

My real hope, though, was that Aaru or some other package would have the option to simultaneously store both unscrambled and scrambled using some kind of internal representation like I've described in order to save space. I don't think such package exists, though.