Из Википедии, бесплатной энциклопедии
Перейти к навигации Перейти к поиску

В информатике , А сжатая структура данных является структурой данных , которая использует количество пространства, которое «близко» к информации теоретико- нижняя граница, но ( в отличии от других сжатых представлений) по- прежнему дает возможность эффективной работы запроса. Эта концепция была первоначально введена Якобсоном [1] для кодирования битовых векторов , (немаркированных) деревьев и плоских графов . В отличие от обычных алгоритмов сжатия данных без потерь , сжатые структуры данных сохраняют возможность использовать их на месте, не распаковывая их предварительно. Связанное с этим понятие - это сжатая структура данных., в котором размер структуры данных зависит от конкретных представляемых данных.

Предположим, что это оптимальное с теоретической точки зрения количество битов, необходимое для хранения некоторых данных. Представление этих данных называется:

  • неявно, если занимаетнемного места,
  • кратко, если занимает немного места, и
  • компактный, если занимает немного места.

Например, структура данных, в которой используются биты памяти, компактна, биты - лаконичны, биты также лаконичны, а биты неявны.

Таким образом, неявные структуры обычно сводятся к хранению информации с использованием некоторой перестановки входных данных; самый известный пример - куча .

Краткие словари [ править ]

Краткие индексируемые словари, называемые также ранга / выберите словари, образуют основу ряда методов малообъемных представления, в том числе бинарных деревьев , -ичных деревьев и мультимножествами , [2] , а также деревьев суффиксов и массивов . [3] Основная проблема состоит в том, чтобы сохранить подмножество юниверса , обычно представленное как битовый массив, где iff . Индексируемый словарь поддерживает обычные методы словарей (запросы и вставки / удаления в динамическом случае), а также следующие операции :

для .

Другими словами, возвращает количество элементов, равных до позиции, а возвращает позицию -го вхождения .

There is a simple representation[4] which uses bits of storage space (the original bit array and an auxiliary structure) and supports rank and select in constant time. It uses an idea similar to that for range-minimum queries; there are a constant number of recursions before stopping at a subproblem of a limited size. The bit array is partitioned into large blocks of size bits and small blocks of size bits. For each large block, the rank of its first bit is stored in a separate table ; each such entry takes bits for a total of bits of storage. Within a large block, another directory stores the rank of each of the small blocks it contains. The difference here is that it only needs bits for each entry, since only the differences from the rank of the first bit in the containing large block need to be stored. Thus, this table takes a total of bits. A lookup table can then be used that stores the answer to every possible rank query on a bit string of length for ; this requires bits of storage space. Thus, since each of these auxiliary tables take space, this data structure supports rank queries in time and bits of space.

To answer a query for in constant time, a constant time algorithm computes:

In practice, the lookup table can be replaced by bitwise operations and smaller tables that can be used to find the number of bits set in the small blocks. This is often beneficial, since succinct data structures find their uses in large data sets, in which case cache misses become much more frequent and the chances of the lookup table being evicted from closer CPU caches becomes higher.[5] Select queries can be easily supported by doing a binary search on the same auxiliary structure used for rank; however, this takes time in the worst case. A more complicated structure using bits of additional storage can be used to support select in constant time.[6] In practice, many of these solutions have hidden constants in the notation which dominate before any asymptotic advantage becomes apparent; implementations using broadword operations and word-aligned blocks often perform better in practice.[7]

Entropy-compressed dictionaries[edit]

The space approach can be improved by noting that there are distinct -subsets of (or binary strings of length with exactly 1’s), and thus is an information theoretic lower bound on the number of bits needed to store . There is a succinct (static) dictionary which attains this bound, namely using space.[8] This structure can be extended to support rank and select queries and takes space.[2] Correct rank queries in this structure are however limited to elements contained in the set, analogous to how minimal perfect hashing functions work. This bound can be reduced to a space/time tradeoff by reducing the storage space of the dictionary to with queries taking time.[9]

Examples[edit]

A null-terminated string (C string) takes Z + 1 space, and is thus implicit. A string with an arbitrary length (Pascal string) takes Z + log(Z) space, and is thus succinct. If there is a maximum length – which is the case in practice, since 232 = 4 GiB of data is a very long string, and 264 = 16 EiB of data is larger than any string in practice – then a string with a length is also implicit, taking Z + k space, where k is the number of data to represent the maximum length (e.g., 64 bits).

When a sequence of variable-length items (such as strings) needs to be encoded, there are various possibilities. A direct approach is to store a length and an item in each record – these can then be placed one after another. This allows efficient next, but not finding the kth item. An alternative is to place the items in order with a delimiter (e.g., null-terminated string). This uses a delimiter instead of a length, and is substantially slower, since the entire sequence must be scanned for delimiters. Both of these are space-efficient. An alternative approach is out-of-band separation: the items can simply be placed one after another, with no delimiters. Item bounds can then be stored as a sequence of length, or better, offsets into this sequence. Alternatively, a separate binary string consisting of 1s in the positions where an item begins, and 0s everywhere else is encoded along with it. Given this string, the function can quickly determine where each item begins, given its index.[10] This is compact but not succinct, as it takes 2Z space, which is O(Z).

Another example is the representation of a binary tree: an arbitrary binary tree on nodes can be represented in bits while supporting a variety of operations on any node, which includes finding its parent, its left and right child, and returning the size of its subtree, each in constant time. The number of different binary trees on nodes is . For large , this is about ; thus we need at least about bits to encode it. A succinct binary tree therefore would occupy only bits per node.

See also[edit]

References[edit]

  1. ^ Jacobson, G. J (1988). Succinct static data structures (Ph.D. thesis). Pittsburgh, PA: Carnegie Mellon University.
  2. ^ a b Raman, R.; V. Raman; S. S Rao (2002). "Succinct indexable dictionaries with applications to encoding k-ary trees and multisets". Proceedings of the thirteenth annual ACM-SIAM symposium on Discrete algorithms. pp. 233–242. arXiv:0705.0552. CiteSeerX 10.1.1.246.3123. doi:10.1145/1290672.1290680. ISBN 0-89871-513-X.
  3. ^ Sadakane, K.; R. Grossi (2006). "Squeezing succinct data structures into entropy bounds" (PDF). Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm. pp. 1230–1239. ISBN 0-89871-605-5. Archived from the original (PDF) on 2011-09-29.
  4. ^ Jacobson, G. (1 November 1989). Space-efficient static trees and graphs (PDF). 30th IEEE Symposium on Foundations of Computer Science. doi:10.1109/SFCS.1989.63533. Archived from the original (PDF) on 2016-03-12.
  5. ^ González, R.; S. Grabowski; V. Mäkinen; G. Navarro (2005). "Practical implementation of rank and select queries" (PDF). Poster Proceedings Volume of 4th Workshop on Efficient and Experimental Algorithms (WEA). pp. 27–38.
  6. ^ Clark, David (1996). Compact pat trees (PDF) (Ph.D. thesis). University of Waterloo.
  7. ^ Vigna, S. (2008). Broadword implementation of rank/select queries (PDF). Experimental Algorithms. Lecture Notes in Computer Science. 5038. pp. 154–168. CiteSeerX 10.1.1.649.8950. doi:10.1007/978-3-540-68552-4_12. ISBN 978-3-540-68548-7.
  8. ^ Brodnik, A.; J. I Munro (1999). "Membership in constant time and almost-minimum space" (PDF). SIAM J. Comput. 28 (5): 1627–1640. CiteSeerX 10.1.1.530.9223. doi:10.1137/S0097539795294165.
  9. ^ Pătraşcu, M. (2008). "Succincter" (PDF). Foundations of Computer Science, 2008. FOCS'08. IEEE 49th Annual IEEE Symposium on. pp. 305–313.
  10. ^ Belazzougui, Djamal. "Hash, displace, and compress" (PDF).