Communities

Writing
Writing
Codidact Meta
Codidact Meta
The Great Outdoors
The Great Outdoors
Photography & Video
Photography & Video
Scientific Speculation
Scientific Speculation
Cooking
Cooking
Electrical Engineering
Electrical Engineering
Judaism
Judaism
Languages & Linguistics
Languages & Linguistics
Software Development
Software Development
Mathematics
Mathematics
Christianity
Christianity
Code Golf
Code Golf
Music
Music
Physics
Physics
Linux Systems
Linux Systems
Power Users
Power Users
Tabletop RPGs
Tabletop RPGs
Community Proposals
Community Proposals
tag:snake search within a tag
answers:0 unanswered questions
user:xxxx search by author id
score:0.5 posts with 0.5+ score
"snake oil" exact phrase
votes:4 posts with 4+ votes
created:<1w created < 1 week ago
post_type:xxxx type of post
Search help
Notifications
Mark all as read See all your notifications »
Q&A

Welcome to Software Development on Codidact!

Will you help us build our independent community of developers helping developers? We're small and trying to grow. We welcome questions about all aspects of software development, from design to code to QA and more. Got questions? Got answers? Got code you'd like someone to review? Please join us.

Why often times data compression causes data loss?

+2
−3

I understand data compression as making data structures nearer (if they are mere machine code without any abstract representation) or representing them in less and less abstract computer languages (from the "complicated" to the "simple").

If so;
Why often times data compression causes data loss?
I don't understand why either "removing gaps" to re-organize data better (as in data defragmentation) or a simpler representation (simpler machine code in computer memory) would neccessate data loss.

I think that my question stands whether the data we work with is "archived" or "not archived".


Update_note: Given the answers and the downvotes I guess I misunderstood the little I have read about this topic, as well as confused it with defragmentation.

History
Why does this post require attention from curators or moderators?
You might want to add some details to your flag.
Why should this post be closed?

1 comment thread

General comments (4 comments)

2 answers

You are accessing this answer with a direct link, so it's being shown above all other answers regardless of its score. You can return to the normal view.

+5
−0

Data compression uses a wide variety of tactics to reduce the storage needed for data, including (just off the top of my head):

  • Run length encoding - e.g., store aaaaa as ax5
  • Tokenizing commonly used items - e.g., a BASIC interpreter might store each keyword as a single byte, since the location/usage is unambiguous (determined by the language structure). The output may actually look difference (e.g., UPPER vs. lower case) but functionally the result will be the same as the original.
  • Analysis to determine frequently used strings and assign special codes for those strings - this can get quite complex but can result in significant lossless compression across a wide variety of data files.
  • Stripping unused bits - e.g., a 7-bit ASCII file can be stored in 7/8 the space of the original, provided that 8th bit is always 0. The catch is that any character with the 8th bit set will need a special code to indicate it (and the special code will need a special code to indicate when it is that code as an actual character, etc.)

The above items are all, if done correctly, lossless compression.

There is another category of lossy compression. This is most often used with images, but could be applied to other data as well, though hopefully not to bank account balances. Lossy compression is typically based on combining things that look alike (e.g., several pixels that are all very close shades of blue - make them all one color so that they can be treated as a compressible block), removing insignificant (to the human eye) detail, or deliberately lowering resolution in a simple space vs. quality tradeoff. Probably the most commonly encountered example is JPEG.

The bottom line is that lossless compression has limits. A typical example is that plain text data might compress 90% - e.g., 100k -> 10k with zip or a similar algorithm. Process that same file again, with the same or a similar algorithm, and the size will still be around 10k. If you could keep compressing indefinitely, every file would eventually compress down to a single bit, and storage manufacturers would be out of business.

Back to some terms in the original question:

making data structures nearer (if they are mere machine code without any abstract representation)

At the data structure level, it is sometimes possible to compress things. For example, using integers as references between data objects instead of human-readable text can make a big difference in storage space and in speed of access. That difference in speed of access can go both ways though: integers make for more compact index files, but on the other hand, the human-readable text has to be read from another file, resulting in an extra storage access.

representing them in less and less abstract computer languages (from the "complicated" to the "simple").

Not exactly languages. Generally speaking, any language (from C to Python to LISP, etc.) in the end boils down to simple assembly language. The issue here is the storage of the data, which is, largely, language independent.

either "removing gaps"

If done by run-length-encoding (or similar) of whitespace (in text) or null data (in a data file), then that should be lossless.

On the other hand, removing whitespace in an HTML file will produce no visual change in a browser but will result in differences when editing. Even more so, removing whitespace in a Javascript file (minifying) results in a file that is much less human readable but will execute faster. Both of these are, arguably, forms of lossy compression - they can't be automagically returned to their original, more human-readable, state.

re-organize data better (as in data defragmentation)

Defragmentation can make access faster but does not necessarily provide any data compression.

or a simpler representation (simpler machine code in computer memory) would neccessate data loss.

A simpler representation is, by definition, a lossy compression. Something must be taken away to make it simpler.

History
Why does this post require attention from curators or moderators?
You might want to add some details to your flag.

0 comment threads

+5
−0

I understand data compression as making data structures nearer (if they are mere machine code without any abstract representation) or representing them in less and less abstract computer languages (from the "complicated" to the "simple").

Nonsense.

Let's turn to Wikipedia for a better definition:

In signal processing, data compression, source coding,[1] or bit-rate reduction is the process of encoding information using fewer bits than the original representation.

That is, the same information can be represented in different ways, which require a different amount of bits. By picking a representation with fewer bits, we can transmit the same information more cheaply.

For instance, consider the information contained in this answer. I could have sent it to you by holding my phone to the screen, taking a picture, and emailing you that picture. Or I could have copy and pasted the text into the email. Either way, you receive the same information, but one email will be far smaller than the other, and be transmitted far more quickly.

Why often times data compression causes data loss?

Suppose I wanted to tell someone what your avatar looks like. I could do this by telling them, for each pixel, the exact rgb color of that pixel. That would take a long time, but preserve every detail.

Or I could say "It's a pink unicorn with rainbow hair in front of a reddish sky with two white clouds and greenish-gray ground". That's far shorter, and enough information to recognize your avatar, but not enough information to recreate it precisely. Or I could simply say "It's a pink unicorn". That even shorter, and still enough information to distinguish your avatar from mine.

Put differently, the easiest way to compress data is to discard information that doesn't matter. And that's why the most efficient compression (particularly for audio and video) loses information.

But that is not the only way to compress data. I could have said "It's unicorn 36363 at unicornify.pictures". Then, anyone with access to the internet, or who knows the algorithm unicornify uses to turn 36363 into a picture of a unicorn, whould be able to recreate your avatar perfectly. Sometimes, data is best described by the process that created it :-)

Even barring such special cases, a general purpose compression algorithm may be able to exploit redundancy in the original message to shorten it. For instance rather than saying

Her Triumphant Radiance, the Wisdom of the Storm, Duchess of the Seven Seas visited the lands of Duke Henry. After a lengthy stay, Her Triumphant Radiance, the Wisdom of the Storm, Duchess of the Seven Seas, traveled to Permbridge Hold, where Her Triumphant Radiance, the Wisdom of the Storm, Duchess of the Seven Seas visited with Lady Alnor.

You could transmit

Her Triumphant Radiance, the Wisdom of the Storm, Duchess of the Seven Seas (henceforth called HTR), visited the lands of Duke Henry. After a lengthy stay, HTR traveled to Permbridge Hold, where HTR visited with Lady Alnor.

Which is far shorter, but permits the original message to be recreated perfectly. Of course, such general compression only works if the initial message is redundant in a way the compression algorithm recognizes.

History
Why does this post require attention from curators or moderators?
You might want to add some details to your flag.

0 comment threads

Sign up to answer this question »