Welcome to Software Development on Codidact!

Will you help us build our independent community of developers helping developers? We're small and trying to grow. We welcome questions about all aspects of software development, from design to code to QA and more. Got questions? Got answers? Got code you'd like someone to review? Please join us.

Why often times data compression causes data loss?

−3

I understand data compression as making data structures nearer (if they are mere machine code without any abstract representation) or representing them in less and less abstract computer languages (from the "complicated" to the "simple").

If so;
Why often times data compression causes data loss?
I don't understand why either "removing gaps" to re-organize data better (as in data defragmentation) or a simpler representation (simpler machine code in computer memory) would neccessate data loss.

I think that my question stands whether the data we work with is "archived" or "not archived".

Update_note: Given the answers and the downvotes I guess I misunderstood the little I have read about this topic, as well as confused it with defragmentation.

data-structures compression data-loss

posted about 4 years ago

CC BY-SA 4.0

4y ago

deleted user

Raw

Markdown

History

is a duplicate

This question has been asked before and has already been answered. It should be marked as a duplicate.

Please enter the URL of the proposed duplicate in the details field below.

not constructive

This question cannot be answered in a way that is helpful to anyone. It's not possible to learn something from possible answers, except for the solution for the specific problem of the asker.

1 comment thread

General comments (4 comments)

2 answers

Score Active Age

−0

Data compression uses a wide variety of tactics to reduce the storage needed for data, including (just off the top of my head):

Run length encoding - e.g., store aaaaa as ax5
Tokenizing commonly used items - e.g., a BASIC interpreter might store each keyword as a single byte, since the location/usage is unambiguous (determined by the language structure). The output may actually look difference (e.g., UPPER vs. lower case) but functionally the result will be the same as the original.
Analysis to determine frequently used strings and assign special codes for those strings - this can get quite complex but can result in significant lossless compression across a wide variety of data files.
Stripping unused bits - e.g., a 7-bit ASCII file can be stored in 7/8 the space of the original, provided that 8th bit is always 0. The catch is that any character with the 8th bit set will need a special code to indicate it (and the special code will need a special code to indicate when it is that code as an actual character, etc.)

The above items are all, if done correctly, lossless compression.

There is another category of lossy compression. This is most often used with images, but could be applied to other data as well, though hopefully not to bank account balances. Lossy compression is typically based on combining things that look alike (e.g., several pixels that are all very close shades of blue - make them all one color so that they can be treated as a compressible block), removing insignificant (to the human eye) detail, or deliberately lowering resolution in a simple space vs. quality tradeoff. Probably the most commonly encountered example is JPEG.

The bottom line is that lossless compression has limits. A typical example is that plain text data might compress 90% - e.g., 100k -> 10k with zip or a similar algorithm. Process that same file again, with the same or a similar algorithm, and the size will still be around 10k. If you could keep compressing indefinitely, every file would eventually compress down to a single bit, and storage manufacturers would be out of business.

Back to some terms in the original question:

making data structures nearer (if they are mere machine code without any abstract representation)

At the data structure level, it is sometimes possible to compress things. For example, using integers as references between data objects instead of human-readable text can make a big difference in storage space and in speed of access. That difference in speed of access can go both ways though: integers make for more compact index files, but on the other hand, the human-readable text has to be read from another file, resulting in an extra storage access.

representing them in less and less abstract computer languages (from the "complicated" to the "simple").

Not exactly languages. Generally speaking, any language (from C to Python to LISP, etc.) in the end boils down to simple assembly language. The issue here is the storage of the data, which is, largely, language independent.

either "removing gaps"

If done by run-length-encoding (or similar) of whitespace (in text) or null data (in a data file), then that should be lossless.

On the other hand, removing whitespace in an HTML file will produce no visual change in a browser but will result in differences when editing. Even more so, removing whitespace in a Javascript file (minifying) results in a file that is much less human readable but will execute faster. Both of these are, arguably, forms of lossy compression - they can't be automagically returned to their original, more human-readable, state.

re-organize data better (as in data defragmentation)

Defragmentation can make access faster but does not necessarily provide any data compression.

or a simpler representation (simpler machine code in computer memory) would neccessate data loss.

A simpler representation is, by definition, a lossy compression. Something must be taken away to make it simpler.

posted about 4 years ago

CC BY-SA 4.0

manassehkatz‭ staff

680 reputation 1 13 63 9

Copy Link

Raw

Markdown

History

0 comment threads

−0

I understand data compression as making data structures nearer (if they are mere machine code without any abstract representation) or representing them in less and less abstract computer languages (from the "complicated" to the "simple").

Nonsense.

Let's turn to Wikipedia for a better definition:

In signal processing, data compression, source coding,[1] or bit-rate reduction is the process of encoding information using fewer bits than the original representation.

That is, the same information can be represented in different ways, which require a different amount of bits. By picking a representation with fewer bits, we can transmit the same information more cheaply.

For instance, consider the information contained in this answer. I could have sent it to you by holding my phone to the screen, taking a picture, and emailing you that picture. Or I could have copy and pasted the text into the email. Either way, you receive the same information, but one email will be far smaller than the other, and be transmitted far more quickly.

Why often times data compression causes data loss?

Suppose I wanted to tell someone what your avatar looks like. I could do this by telling them, for each pixel, the exact rgb color of that pixel. That would take a long time, but preserve every detail.

Or I could say "It's a pink unicorn with rainbow hair in front of a reddish sky with two white clouds and greenish-gray ground". That's far shorter, and enough information to recognize your avatar, but not enough information to recreate it precisely. Or I could simply say "It's a pink unicorn". That even shorter, and still enough information to distinguish your avatar from mine.

Put differently, the easiest way to compress data is to discard information that doesn't matter. And that's why the most efficient compression (particularly for audio and video) loses information.

But that is not the only way to compress data. I could have said "It's unicorn 36363 at unicornify.pictures". Then, anyone with access to the internet, or who knows the algorithm unicornify uses to turn 36363 into a picture of a unicorn, whould be able to recreate your avatar perfectly. Sometimes, data is best described by the process that created it :-)

Even barring such special cases, a general purpose compression algorithm may be able to exploit redundancy in the original message to shorten it. For instance rather than saying

Her Triumphant Radiance, the Wisdom of the Storm, Duchess of the Seven Seas visited the lands of Duke Henry. After a lengthy stay, Her Triumphant Radiance, the Wisdom of the Storm, Duchess of the Seven Seas, traveled to Permbridge Hold, where Her Triumphant Radiance, the Wisdom of the Storm, Duchess of the Seven Seas visited with Lady Alnor.

You could transmit

Her Triumphant Radiance, the Wisdom of the Storm, Duchess of the Seven Seas (henceforth called HTR), visited the lands of Duke Henry. After a lengthy stay, HTR traveled to Permbridge Hold, where HTR visited with Lady Alnor.

Which is far shorter, but permits the original message to be recreated perfectly. Of course, such general compression only works if the initial message is redundant in a way the compression algorithm recognizes.

posted about 4 years ago

CC BY-SA 4.0

meriton‭

2375 reputation 7 56 220 35

Copy Link

Raw

Markdown

History

Communities

Why often times data compression causes data loss?

1 comment thread

2 answers

0 comment threads

0 comment threads