Welcome to Software Development on Codidact!
Will you help us build our independent community of developers helping developers? We're small and trying to grow. We welcome questions about all aspects of software development, from design to code to QA and more. Got questions? Got answers? Got code you'd like someone to review? Please join us.
Post History
Data compression uses a wide variety of tactics to reduce the storage needed for data, including (just off the top of my head): Run length encoding - e.g., store aaaaa as ax5 Tokenizing commonl...
Answer
#1: Initial revision
*Data compression* uses a wide variety of tactics to reduce the storage needed for data, including (just off the top of my head): * Run length encoding - e.g., store *aaaaa* as *a*x5 * Tokenizing commonly used items - e.g., a BASIC interpreter might store each keyword as a single byte, since the location/usage is unambiguous (determined by the language structure). The output may actually look difference (e.g., UPPER vs. lower case) but functionally the result will be the same as the original. * Analysis to determine frequently used strings and assign special codes for those strings - this can get quite complex but can result in significant lossless compression across a wide variety of data files. * Stripping unused bits - e.g., a 7-bit ASCII file can be stored in 7/8 the space of the original, provided that 8th bit is always 0. The catch is that any character with the 8th bit set will need a special code to indicate it (and the special code will need a special code to indicate when it is that code as an actual character, etc.) The above items are all, if done correctly, **lossless** compression. There is another category of **lossy** compression. This is most often used with images, but could be applied to other data as well, though hopefully not to bank account balances. Lossy compression is typically based on combining things that look alike (e.g., several pixels that are all very close shades of blue - make them all one color so that they can be treated as a compressible block), removing insignificant (to the human eye) detail, or deliberately lowering resolution in a simple space vs. quality tradeoff. Probably the most commonly encountered example is [JPEG](https://en.wikipedia.org/wiki/JPEG). The bottom line is that lossless compression has limits. A typical example is that plain text data might compress 90% - e.g., 100k -> 10k with zip or a similar algorithm. Process that same file again, with the same or a similar algorithm, and the size will still be around 10k. If you could keep compressing indefinitely, every file would eventually compress down to a single bit, and storage manufacturers would be out of business. Back to some terms in the original question: > making data structures nearer (if they are mere machine code without any abstract representation) At the data structure level, it is sometimes possible to compress things. For example, using integers as references between data objects instead of human-readable text can make a big difference in storage space and in speed of access. That difference in speed of access can go both ways though: integers make for more compact index files, but on the other hand, the human-readable text has to be read from another file, resulting in an extra storage access. > representing them in less and less abstract computer languages (from the "complicated" to the "simple"). Not exactly languages. Generally speaking, any language (from C to Python to LISP, etc.) in the end boils down to simple assembly language. The issue here is the storage of the *data*, which is, largely, language independent. > either "removing gaps" If done by run-length-encoding (or similar) of whitespace (in text) or null data (in a data file), then that should be lossless. On the other hand, removing whitespace in an HTML file will produce no visual change in a browser but will result in differences when editing. Even more so, removing whitespace in a Javascript file (minifying) results in a file that is much less human readable but will execute faster. Both of these are, arguably, forms of lossy compression - they can't be automagically returned to their original, more human-readable, state. > re-organize data better (as in data defragmentation) Defragmentation can make access faster but does not necessarily provide any data compression. > or a simpler representation (simpler machine code in computer memory) would neccessate data loss. A simpler representation is, by definition, a lossy compression. Something must be taken away to make it simpler.