DNA Bloom Filter enables anti-contamination and file version control for DNA-based data storage

Abstract

DNA storage is one of the most promising ways for future information storage due to its high data storage density, durable storage time and low maintenance cost. However, errors are inevitable during synthesizing, storing and sequencing. Currently, many error correction algorithms have been developed to ensure accurate information retrieval, but they will decrease storage density or increase computing complexity. Here, we apply the Bloom Filter, a space-efficient probabilistic data structure, to DNA storage to achieve the anti-error, or anti-contamination function. This method only needs the original correct DNA sequences (referred to as target sequences) to produce a corresponding data structure, which will filter out almost all the incorrect sequences (referred to as non-target sequences) during sequencing data analysis. Experimental results demonstrate the universal and efficient filtering capabilities of our method. Furthermore, we employ the Counting Bloom Filter to achieve the file version control function, which significantly reduces synthesis costs when modifying DNA-form files. To achieve cost-efficient file version control function, a modified system based on yin–yang codec is developed.

DNA-based data storage, bloom filter, anti-contamination, file version control

INTRODUCTION

With the rapid growth of information volume, current storage mediums are anticipated to fall short in meeting the demands of data preservation in the future. In comparison with common storage mediums such as tapes, hard disks and flash drives, DNA molecules are becoming a promising storage medium due to their exceptional data storage density, long-term data retention and low maintenance costs. In order to promote the development of DNA-based data storage, early efforts have primarily concentrated on the construction of algorithms including constrained codes and error-correcting codes [1–10], the optimization of bioinformatics methods [11, 12], the development of software packages [13–15], the design of storage functions [16–20] and the integration of automatic input/output devices [21–23].

Among the principal objectives in constructing algorithms is the tolerance of errors that occur during the production, preservation and observation of DNA molecules containing digital data. The manifestation of errors in DNA molecules is exceptionally intricate [24], including nucleotide-scale errors such as insertion, deletion and substitution, as well as sequence-scale errors like loss and break. Early efforts introduced error-correcting codes [25–27] and optimized downstream technologies including sequence clustering [11, 28] and multiple sequence alignment [29], to resist these errors to some extent. Nevertheless, real-world applications are anticipated to be more complicated, potentially diminishing the effectiveness of these efforts. For example, the risk of cross-contamination in experiments and external environments poses a threat, leading to the possibility of obtaining a DNA sequence in a batch that does not genuinely belong to it, which is regarded as a non-target DNA sequence. This expanded definition of errors necessitates leveraging a broader range of technologies from the field of computer engineering to address this intensified challenge of error resistance, referred to as anti-contamination.

Compared with conventional storage techniques, DNA-based data storage exhibits a unique feature: the molecular product derived from each DNA sequence encompasses multiple copies. Therefore, in addition to correcting errors in each retrieved DNA sequence, another potential strategy for resisting errors could be the filtration of incorrect DNA sequences, which holds a low computational complexity while obtaining target DNA sequences. Nowadays, filters are extensively employed in diverse computer engineering contexts, including spam detection [30] and webpage address duplication elimination [31]. Wu provided a striking usage scenario for the most classic and pragmatic one, called Bloom Filter (BF) [32]: A blacklist dataset comprises 10 billion blacklisted website addresses, with each address taking up a maximum of 64 bytes. Using no more than 30 gigabytes of storage space, the BF is capable of determining whether a given website address is present in the blacklist with an error rate of less than one in 10 000 [33]. Actually, the BF has also been applied in the field of bioinformatics [34], which uses the BF as a container to store the characters of each sequence and regard it as a vertex in a de Bruijn graph to achieve the storing of a pan-genome. Hence, we have grounds to elaborate that the BF holds the potential for application in DNA-based data storage.

Here, we employ numerical simulations, just like previous experiments [35, 36], to substantiate the feasibility of utilizing filters to address the anti-contamination issue and further construct a file version control system to demonstrate the application of filters in dynamic storage functionality [17, 37]. We introduced BF and provided relevant variants (detailed in Methods and Materials) to complete the corresponding feasibility verification. For the convenience of memory, we refer to this set of filters applied in DNA-based data storage as DNA Bloom Filter (DNA-BF). DNA-BF can achieve an accurate anti-contamination function, even if the proportion of target sequences is only four thousandths or even smaller. The anti-contamination function is robust and is not influenced by different file types, file sizes, file formats, matching coding schemes and filter parameter configurations. Furthermore, the file version control system based on the DNA-BF can achieve file modifications with only the need to re-synthesize the modified part of the files, which can greatly reduce the synthesis cost.

RESULTS

Overview of DNA-BF

DNA BF for anti-contamination function is performed with a general BF, consisting of an array and several hash functions. It indirectly stores elements using several hash functions for each element, mapping the positions of the array to 1, without any processing for the positions where hash collisions occur. It is a fast, low memory-cost data structure to detect whether an element exists in a set at the cost of some accuracy, resulting in a certain false positive rate (⁠|$r_{fp}$|⁠), which is shown in Figure S2.

DNA BF for file version control function is performed with a variant structure of the general BF, called Counting Bloom Filter (CBF). The difference between it and the general one is that when storing elements, each position of the array records the number of hash functions mapping to it, which let it support the deletion operation of elements. The deletion operation is to subtract 1 from the values of the positions that the hash functions of elements to be deleted map, without affecting the detection of other elements (Figure S3).

For clarification, the notations used in this paper is listed in the following Table 1.

Table 1

Open in new tab

Notation used in this paper

Notation	Description
\|$S$\|	A DNA sequence randomly generated by a particular coding scheme.
\|$\boldsymbol{S}^{\mathrm{i}}$\|	A set of generated DNA sequences.
\|$\boldsymbol{S}^{\mathrm{o}}$\|	A set of sequences obtained from \|$\boldsymbol{S}^{\mathrm{i}}$\| via a noise channel.
\|$\boldsymbol{S}^{\mathrm{r}}$\|	A set of sequences obtained from \|$\boldsymbol{S}^{\mathrm{o}}$\| through a DNA BF \|$F$\| (see below).
\|$\boldsymbol{S}^{\mathbb{T}}$\|	A set of sequences identified as false positive sequences by a DNA BF.
\|$n$\|	The size of a DNA sequence set, in this study, \|$n=\|\boldsymbol{S}^{\mathrm{i}}\|$\|⁠.
\|$\boldsymbol{a}_{l}^{m}$\|	An array of length \|$l$\| and the maximum value in each cell of this array is \|$m$\|⁠, \|$m=1$\| or \|$m>n$\|⁠.
\|$\boldsymbol{a}_{l}^{m}[i]$\|	\|$i$\|-th element of \|$\boldsymbol{a}_{l}^{m}$\|⁠.
\|$H$\|	A hash function.
\|$\boldsymbol{H}_{k}$\|	A function group composed of \|$k$\| hash functions.
\|$F$\|	A DNA BF \|$F=<n\|\boldsymbol{a}_{l}^{m}\|\boldsymbol{H}_{k}>$\|⁠. For a BF, \|$m=1$\|⁠.
\|$r_{fp}$\|	Pre-set false positive rate of a DNA BF. \|$r_{fp}=(1 - (1 - \frac{1}{l})^{nk})^{k}$\|
\|$r_{tfp}$\|	Actual false positive rate of a DNA BF. In practice, \|$r_{tfp}= \frac{\boldsymbol{S}^{\mathrm{r}} - \boldsymbol{S}^{\mathrm{i}}}{\boldsymbol{S}^{\mathrm{o}} - \boldsymbol{S}^{\mathrm{i}}}$\|⁠.
\|$r_{tfn}$\|	Actual false negative rate of a DNA BF. In practice, \|$r_{tfn}= \frac{\boldsymbol{S}^{\mathbb{T}} - \boldsymbol{S}^{\mathrm{i}}}{\boldsymbol{S}^{\mathrm{r}} - \boldsymbol{S}^{\mathrm{i}}}$\|⁠.
\|$\boldsymbol{M}^{\mathrm{nt}}$\|	Maximum coverage number of non-target sequences.
\|$\boldsymbol{M}^{\mathrm{t}}$\|	Minimum coverage number of target sequences.

Notation	Description
\|$S$\|	A DNA sequence randomly generated by a particular coding scheme.
\|$\boldsymbol{S}^{\mathrm{i}}$\|	A set of generated DNA sequences.
\|$\boldsymbol{S}^{\mathrm{o}}$\|	A set of sequences obtained from \|$\boldsymbol{S}^{\mathrm{i}}$\| via a noise channel.
\|$\boldsymbol{S}^{\mathrm{r}}$\|	A set of sequences obtained from \|$\boldsymbol{S}^{\mathrm{o}}$\| through a DNA BF \|$F$\| (see below).
\|$\boldsymbol{S}^{\mathbb{T}}$\|	A set of sequences identified as false positive sequences by a DNA BF.
\|$n$\|	The size of a DNA sequence set, in this study, \|$n=\|\boldsymbol{S}^{\mathrm{i}}\|$\|⁠.
\|$\boldsymbol{a}_{l}^{m}$\|	An array of length \|$l$\| and the maximum value in each cell of this array is \|$m$\|⁠, \|$m=1$\| or \|$m>n$\|⁠.
\|$\boldsymbol{a}_{l}^{m}[i]$\|	\|$i$\|-th element of \|$\boldsymbol{a}_{l}^{m}$\|⁠.
\|$H$\|	A hash function.
\|$\boldsymbol{H}_{k}$\|	A function group composed of \|$k$\| hash functions.
\|$F$\|	A DNA BF \|$F=<n\|\boldsymbol{a}_{l}^{m}\|\boldsymbol{H}_{k}>$\|⁠. For a BF, \|$m=1$\|⁠.
\|$r_{fp}$\|	Pre-set false positive rate of a DNA BF. \|$r_{fp}=(1 - (1 - \frac{1}{l})^{nk})^{k}$\|
\|$r_{tfp}$\|	Actual false positive rate of a DNA BF. In practice, \|$r_{tfp}= \frac{\boldsymbol{S}^{\mathrm{r}} - \boldsymbol{S}^{\mathrm{i}}}{\boldsymbol{S}^{\mathrm{o}} - \boldsymbol{S}^{\mathrm{i}}}$\|⁠.
\|$r_{tfn}$\|	Actual false negative rate of a DNA BF. In practice, \|$r_{tfn}= \frac{\boldsymbol{S}^{\mathbb{T}} - \boldsymbol{S}^{\mathrm{i}}}{\boldsymbol{S}^{\mathrm{r}} - \boldsymbol{S}^{\mathrm{i}}}$\|⁠.
\|$\boldsymbol{M}^{\mathrm{nt}}$\|	Maximum coverage number of non-target sequences.
\|$\boldsymbol{M}^{\mathrm{t}}$\|	Minimum coverage number of target sequences.

Table 1

Open in new tab

Notation used in this paper

Notation	Description
\|$S$\|	A DNA sequence randomly generated by a particular coding scheme.
\|$\boldsymbol{S}^{\mathrm{i}}$\|	A set of generated DNA sequences.
\|$\boldsymbol{S}^{\mathrm{o}}$\|	A set of sequences obtained from \|$\boldsymbol{S}^{\mathrm{i}}$\| via a noise channel.
\|$\boldsymbol{S}^{\mathrm{r}}$\|	A set of sequences obtained from \|$\boldsymbol{S}^{\mathrm{o}}$\| through a DNA BF \|$F$\| (see below).
\|$\boldsymbol{S}^{\mathbb{T}}$\|	A set of sequences identified as false positive sequences by a DNA BF.
\|$n$\|	The size of a DNA sequence set, in this study, \|$n=\|\boldsymbol{S}^{\mathrm{i}}\|$\|⁠.
\|$\boldsymbol{a}_{l}^{m}$\|	An array of length \|$l$\| and the maximum value in each cell of this array is \|$m$\|⁠, \|$m=1$\| or \|$m>n$\|⁠.
\|$\boldsymbol{a}_{l}^{m}[i]$\|	\|$i$\|-th element of \|$\boldsymbol{a}_{l}^{m}$\|⁠.
\|$H$\|	A hash function.
\|$\boldsymbol{H}_{k}$\|	A function group composed of \|$k$\| hash functions.
\|$F$\|	A DNA BF \|$F=<n\|\boldsymbol{a}_{l}^{m}\|\boldsymbol{H}_{k}>$\|⁠. For a BF, \|$m=1$\|⁠.
\|$r_{fp}$\|	Pre-set false positive rate of a DNA BF. \|$r_{fp}=(1 - (1 - \frac{1}{l})^{nk})^{k}$\|
\|$r_{tfp}$\|	Actual false positive rate of a DNA BF. In practice, \|$r_{tfp}= \frac{\boldsymbol{S}^{\mathrm{r}} - \boldsymbol{S}^{\mathrm{i}}}{\boldsymbol{S}^{\mathrm{o}} - \boldsymbol{S}^{\mathrm{i}}}$\|⁠.
\|$r_{tfn}$\|	Actual false negative rate of a DNA BF. In practice, \|$r_{tfn}= \frac{\boldsymbol{S}^{\mathbb{T}} - \boldsymbol{S}^{\mathrm{i}}}{\boldsymbol{S}^{\mathrm{r}} - \boldsymbol{S}^{\mathrm{i}}}$\|⁠.
\|$\boldsymbol{M}^{\mathrm{nt}}$\|	Maximum coverage number of non-target sequences.
\|$\boldsymbol{M}^{\mathrm{t}}$\|	Minimum coverage number of target sequences.

Notation	Description
\|$S$\|	A DNA sequence randomly generated by a particular coding scheme.
\|$\boldsymbol{S}^{\mathrm{i}}$\|	A set of generated DNA sequences.
\|$\boldsymbol{S}^{\mathrm{o}}$\|	A set of sequences obtained from \|$\boldsymbol{S}^{\mathrm{i}}$\| via a noise channel.
\|$\boldsymbol{S}^{\mathrm{r}}$\|	A set of sequences obtained from \|$\boldsymbol{S}^{\mathrm{o}}$\| through a DNA BF \|$F$\| (see below).
\|$\boldsymbol{S}^{\mathbb{T}}$\|	A set of sequences identified as false positive sequences by a DNA BF.
\|$n$\|	The size of a DNA sequence set, in this study, \|$n=\|\boldsymbol{S}^{\mathrm{i}}\|$\|⁠.
\|$\boldsymbol{a}_{l}^{m}$\|	An array of length \|$l$\| and the maximum value in each cell of this array is \|$m$\|⁠, \|$m=1$\| or \|$m>n$\|⁠.
\|$\boldsymbol{a}_{l}^{m}[i]$\|	\|$i$\|-th element of \|$\boldsymbol{a}_{l}^{m}$\|⁠.
\|$H$\|	A hash function.
\|$\boldsymbol{H}_{k}$\|	A function group composed of \|$k$\| hash functions.
\|$F$\|	A DNA BF \|$F=<n\|\boldsymbol{a}_{l}^{m}\|\boldsymbol{H}_{k}>$\|⁠. For a BF, \|$m=1$\|⁠.
\|$r_{fp}$\|	Pre-set false positive rate of a DNA BF. \|$r_{fp}=(1 - (1 - \frac{1}{l})^{nk})^{k}$\|
\|$r_{tfp}$\|	Actual false positive rate of a DNA BF. In practice, \|$r_{tfp}= \frac{\boldsymbol{S}^{\mathrm{r}} - \boldsymbol{S}^{\mathrm{i}}}{\boldsymbol{S}^{\mathrm{o}} - \boldsymbol{S}^{\mathrm{i}}}$\|⁠.
\|$r_{tfn}$\|	Actual false negative rate of a DNA BF. In practice, \|$r_{tfn}= \frac{\boldsymbol{S}^{\mathbb{T}} - \boldsymbol{S}^{\mathrm{i}}}{\boldsymbol{S}^{\mathrm{r}} - \boldsymbol{S}^{\mathrm{i}}}$\|⁠.
\|$\boldsymbol{M}^{\mathrm{nt}}$\|	Maximum coverage number of non-target sequences.
\|$\boldsymbol{M}^{\mathrm{t}}$\|	Minimum coverage number of target sequences.

Anti-contamination by DNA-BF

The anti-contamination function includes two rounds of screening for sequencing data. The first round is to filter out most of the non-target sequences through the BF generated using the library of target sequences. As in real cases, considering the low error rates of synthesis and sequencing technologies [38], as well as randomly occurring mutation or contamination, the molecular copy number of non-target sequences should be much smaller than that of target ones. Therefore, it is expected to be a significant difference of the coverage numbers (the number of each sequence reads after sequencing) between non-target sequences and target ones. The second round of screening take advantage of this coverage difference to further filter the false positive sequences. The complete process of the anti-contamination method is illustrated in Figure 1.

Figure 1

Illustration of filter-based error tolerance. Here we take the Bloom Filter as an example. The top part of the figure illustrates the fundamental process of DNA-based data storage, which includes the conversion of various file formats into DNA sequences, DNA synthesis, Polymerase Chain Reaction amplification and DNA sequencing. The lower part of the figure shows the error tolerance process we propose, where a Bloom Filter is generated from the DNA sequences prior to DNA synthesis. This filter is then utilized to identify and screen out incorrect or non-target DNA sequences that may be obtained during DNA sequencing.

Open in new tab Download slide

Effectiveness of anti-contamination function

To verify the effectiveness of BF, we conduct experiments using two DNA sequence libraries with an identical size (same index length, data payload length and quantity): one consists of sequences encoded by yin–yang codec (YYC) (YYC library) for a readable digital file, and another consists of purely random generated sequences (random library). The detailed sequence features of random library is shown in Figure S4. To better match the reality, the molecular copy number of sequences in the libraries follows a normal distribution according to high-throughput DNA synthesis technologies [39–42]. Before filtering by BF, random errors according to real DNA synthesis and sequencing error rates were introduced to the library (see Methods) [36]. For clarity, only the number of sequence type is counted, which means that even if the simulated coverage of the same sequence is large, it is only counted as one sequence.

As shown in Figure 2A and Figure 2B, without the implementation of any error correction algorithm or error sequence filtering strategy, it is evident that the proportion of non-target sequences is much higher than that of the target sequences. In contrast, after BF is used for filtering, all target sequences and a small number of non-target sequences are obtained as shown in Figure 2C and Figure 2D, the proportion of target sequences increases from a few tenths of a percent to over 75%. Meanwhile, with coverage number of all the sequences increasing, the proportion of non-target sequences increases rapidly. A speculated reason would be that with the coverage number increasing, more false positive sequences will be detected according to the pre-set |$r_{fp}$| (Figure S4). As target sequences are all obtained from coverage number of 20 or 22 (Figure S6) and remains, the proportion of non-target sequences increases. These demonstrate that BF can effectively filter out most of the non-target sequences, and its degree of anti-contamination is related to the parameter-setting of itself and the number of non-target sequence types in simulation.

Figure 2

Effectiveness of the anti-contamination function for sequences encoded by YYC [7] compared with purely random generated sequences. The target sequences in (A), (C), (E) and (F) are encoded by YYC, whereas the target sequences in (B) and (D) are randomly generated. (A, B) Without anti-contamination processing. (C, D) Anti-contamination processing only with Bloom Filter. (E) The coverage of target and non-target sequences across different simulated sequencing depths. (F) Anti-contamination function, which combines the Bloom Filter with coverage-based filtration.

Open in new tab Download slide

As mentioned earlier, it is expected that there is a gap between the coverage of target sequences and non-target sequences. Figure 2E illustrates that the significant difference in coverage between target and non-target sequences emerges when the sequencing depth reaches 30, and it expands as the sequencing depth increases. Figure 2F demonstrates the remained non-target sequences can be further removed by the significant coverage difference between the numbers of target sequences and non-target sequences. With the sequencing depth increases, this difference increases, which is more conducive to separating target sequences from non-target sequences.

The same experiment is also performed on sequences encoded by DNA Fountain [5] (DNA Fountain library), the result shows a similar anti-contamination ability (Figure S7) and robustness (Figure S8). This implies that the anti-contamination function of DNA-BF is not limited by the bit-to-base encoding method.

Robustness of anti-contamination function

To evaluate the robustness of the anti-contamination function and explore the potential factors that may affect the effectiveness of BF, we conducted the in silico experiments from different perspectives, including pre-set |$r_{fp}$|⁠, file size, file type and file format. The pre-set |$r_{fp}$| is an internal factor of BF itself, which directly determines the proportion of false positive sequences. File size affects the total number of original target sequences. Different file type means different byte-frequency distributions [14]. Different file format means significantly different presentation visible under the same type file, as illustrated in Figure S9.

It would be much easier to differentiate target sequence for DNA-BF if the coverage gap is large enough. As shown in Figure 3, we use the difference between minimum coverage of target sequences (⁠|$\boldsymbol{M}^{\mathrm{t}}$|⁠) and maximum coverage of non-target sequences (⁠|$\boldsymbol{M}^{\mathrm{nt}}$|⁠) to evaluate differential efficacy between the target sequences and the non-target sequences. This evaluation metric remains relatively stable across the above-mentioned four perspectives. This implies that the coverage difference can be used to filter out all the non-target sequences in conjunction with BF, showing a robust performance.

Figure 3

Evaluation of robustness of the anti-contamination function. Four factors are considered for this assessment: (A) Pre-set false positive rate. (B) File size. (C) File type. (D) File format. Pure white, random-R and random-RGB are three different images in bmp format. Their visual representations can be found in Figure S9. Specifically, the original BMP file depicted in (C) is the ’random-RGB’ image as indicated in (D).

Open in new tab Download slide

File version control by DNA-BF

Since the efficiency and reliability of the anti-contamination feature is proved in the previous section, it would be practical to apply BF technology to more intricate functions within DNA storage, e.g. file version control function. First of all, we developed a file version control system codec based on YYC. It can generate DNA sequence libraries of the updated version of the files by only translating the modified parts, while leaving other parts unchanged. Based on this file version control codec, complete DNA sequence libraries of each version of files are obtained by combining the modified and unchanged parts. In addition, as shown in Figure 4A,these sequence libraries are used to generate their corresponding CBF for record. During the storage process, sequences from all different versions are stored together in one DNA pool. Then, as shown in Figure 4B, based on the corresponding CBF, it becomes feasible to retrieve all DNA sequences belonging to a specific version even if the library consist DNA sequences belonging to other versions. By the deletable operation of CBF, false positive sequences are further identified for accurate file recovery.

Figure 4

The complete process of the file version control function. (A) DNA-based storage process of different versions of files, and corresponding CBF generation of them. (B) Filtering sequenced data and recovering various file versions.

Open in new tab Download slide

CBF in file version control system

In the file version control system, the DNA sequences of requested version are regarded as the target sequences, while the sequences of other versions and error sequences are regarded as the non-target sequences. The type number of these three kind of sequences in each version is displayed in Table S3. As shown in Figure 5A, it can be seen that the coverage of error sequences is significantly lower than that of target sequences, whereas the coverage of sequences from other versions is within the range of coverage of target sequences, which may lead to troubles for file recovery if general BF is used only. It is not surprising because that each version of DNA files shares similar sequence coverage number for consistency, leads to the high coverage number of part of the false positive sequences. According to probability theory, the overlap among the coverage of DNA sequences in different versions will intensify with the modified sequences of different versions or the number of versions increase. As shown in Table S1, after filtering by general BF and the anti-contamination strategy, only less than 5% of the target sequences can be obtained. It suggests that general BF could not be used for robust file version control.

Figure 5

The method for eliminating false positive information using a CBF in a file version control system. (A) The coverage distribution of detected DNA sequences based on CBF. (B) Comparison with the file version control system before and after the elimination of false positive information.

Open in new tab Download slide

In contrast, CBF can improve the situation remarkably. As Table S2 shows, the percentage of obtained target sequences increase to more than 96% without further operation. However, there will be still 3–4% of the sequences not belonging to the requested version sequences obtained by the CBF, making it challenging to decode the corresponding information. Therefore, we propose a strategy to fully eliminate false positive sequences, which relies on the deletion operation of the CBF, making it accurately obtain the target sequences for requested version (Figure 4B). In Figure 5B, the results before and after deletion operation are presented for each version, demonstrating a 100% recovery of requested version without obtaining non-target sequences.

Robustness evaluation of file version control system

As shown in the previous results, the method based on CBF can eliminate all the non-target sequences. However, it does carry the risk of removing target sequences, or false negative removals as the number of non-target sequences increases, caused by a hash-collision-like effect [43, 44]. The calculation of the false negative rate (⁠|$r_{tfn}$|⁠) can be seen in |$r_{tfn}$| in Table 1. To improve the data recovery accuracy and minimize |$r_{tfn}$|⁠, we conduct an analysis of factors that impact its |$r_{tfn}$| from both internal and external perspectives.

Array size (length of the array) and hash size (number of hash functions) are two major factors that can affect the |$r_{fp}$| of CBF when the elements in the library remain unchanged. Increasing the array size results in a decrease in the |$r_{fp}$|⁠, while increasing the hash size leads to a preliminary decrease in the |$r_{fp}$| followed by an increase. As shown in Figure 6A, with |$r_{fp}$| or |$r_{tfp}$|increasing, the |$r_{tfn}$| increases rapidly and reaches close to 1 when the |$r_{fp}$| or |$r_{fp}$| is approximately 0.1. When the |$r_{fp}$| is small, such as 0.001, and the number of elements is 100 000, no target sequences will be eliminated by mistake.

$Influencing factors of the false negative rate of eliminating false positive information based on CBF. (A) Internal factors. Internal factors include the array size and the hash size of the CBF. The values of the array size are selected by taking 10 more before and after the optimal size with interval 30 000 at 3 values of $r_{fp}$, 0.001, 0.01 and 0.1, which contains $21 \times 3$ values. The hash size ranges from 1 to 20, which contains $20 \times 3$ values. (B) External factors. External factors include simulated sequencing depth and total error rate. There are 5 values of $r_{fp}$, from 0.001 to 0.1. For each value of $r_{fp}$, the simulated sequencing depth has 90 values ranging from 10 to 1000 with interval 10, and the total error rate has 19 values ranging from 0.001 to 0.01 with interval 0.0005.$

Figure 6

Influencing factors of the false negative rate of eliminating false positive information based on CBF. (A) Internal factors. Internal factors include the array size and the hash size of the CBF. The values of the array size are selected by taking 10 more before and after the optimal size with interval 30 000 at 3 values of |$r_{fp}$|⁠, 0.001, 0.01 and 0.1, which contains |$21 \times 3$| values. The hash size ranges from 1 to 20, which contains |$20 \times 3$| values. (B) External factors. External factors include simulated sequencing depth and total error rate. There are 5 values of |$r_{fp}$|⁠, from 0.001 to 0.1. For each value of |$r_{fp}$|⁠, the simulated sequencing depth has 90 values ranging from 10 to 1000 with interval 10, and the total error rate has 19 values ranging from 0.001 to 0.01 with interval 0.0005.

Open in new tab Download slide

In addition, |$r_{tfn}$| is also controlled by the number of non-target sequences in the environment, which indirectly affects the number of false positive sequences. Figure S10 shows that simulated sequencing depth and total error rate further influence the number of the non-target sequences in the environment in the same pattern, which implies that these two factors could have the same effect on |$r_{tfn}$|⁠. Hence, the study combines experimental data obtained from varying simulated sequencing depths and total error rates to investigate the correlation between the |$r_{tfn}$| and the quantity of non-target sequences present in the environment. As shown in Figure 6B, |$r_{tfn}$| increases as the number of the non-target sequences in the environment increases. With the same number of non-target sequences changing, the growth of the |$r_{tfn}$| is slower as the |$r_{fp}$| decreases.

Both internal and external factors ultimately influence |$r_{tfn}$| by influencing the number of false positive sequences detected by CBF. The number of false positive sequences affects the negative integers in the non-positive array. The larger the quantity or values of negative integers in the non-positive array, the greater the likelihood that target sequences will be erroneously identified as false positive sequences and subsequently eliminated. Therefore, in order to eliminate all the false positive sequences while avoiding losing target sequences, it is suggested to set a low |$r_{fp}$| for the CBF, reduce the simulated sequencing depth or total error rate.

DISCUSSION

This study pioneers the application of filtering technology in DNA-based data storage, which provides a practical and precise method to potentially replace the conventional error correction or anti-contamination processes. This approach significantly reduces the steps and time required for information retrieval. To address the complexities of various error types and the growing need for advanced storage functionalities, we have extended the conceptual framework of sequence classification in DNA-based data storage. The new category of ‘non-target DNA sequences’ includes both erroneous DNA sequences and other DNA sequences that are extraneous to the current batch. By employing the classical BF architecture, we have improved the differentiation between target and non-target sequences by leveraging their differential proportion in the whole library. This approach enables us to identify and eliminate non-target DNA sequences with pinpoint accuracy while preserving the expected ones. In addition, if additional error correction or clustering is necessary in a specific data analysis, for example, some target sequences have been lost, the anti-contamination function based on DNA-BF can be used as a preprocessing tool to filter sequencing data to reduce their computation complexity or memory consumption. Actually, no matter whether the coding and decoding method has its own error correction code or needs additional error correction strategy to correct the sequences, our anti-contamination function can improve the accuracy to some extent. Furthermore, by expanding the analysis of sequence properties, we have developed a highly robust file version control system. This system, based on a variation of the traditional BF data structure, called CBF, provides a cost-effective solution for information retention and editing, thus enhancing the reliability and versatility of DNA-based data storage.

The anti-contamination function based on the DNA-BF shows strong accuracy and has no restrictions on the coding schemes of sequences generation. And it is robust as the coverage difference is not affected by the parameter setting of a BF and the characteristics of the file. The file version control system based on the DNA-BF can accurately extract a specific version of the files from a mixed library of multiple versions of the files under appropriate settings. Thus, this study provides new insights into the field of DNA-based data storage research and demonstrates the integration of established information technology into the intricate processes of DNA-based data storage.

In the near future, there is a requirement for a further comprehensive investigation into the implementation of filtering technology within the field of DNA-based data storage. For example, the criteria for determining the coverage threshold are still unclear. Although the file version control system can eliminate non-target sequences by utilizing CBF in normal conditions, the false negative rate may increase prominently if the amount of non-target sequences reaches a certain level. The memory space consumption of the DNA-BF is lower than that of the original files, nevertheless, it still requires traditional storage media to a certain extent. In addition, loss of the DAN-BF incurs the risk of losing the actual data [45, 46]. One potential solution is to convert the BF itself into a separate DNA pool for storage. Moreover, to ensure the accuracy of the DNA-BF itself in case of degradation, contamination, sabotage, etc., it may be necessary to have more backups of DNA sequences, higher physical redundancy, or additional error-correcting codes.

METHODS AND MATERIALS

Sequencing data generation

Target sequences generation

YYC was used to convert binary sequences of actual files into DNA sequences. The payload length of each DNA sequence is 120 nt, and the index length is 20 nt.

For files of different versions, a file version control system codec based on YYC was used to transform binary sequences into DNA sequences. The payload length of each DNA sequence is 120 nt, the prime index length is 20 nt, and the minor index length is 14 nt with 4 nt reserved for marking. A detailed illustration of the sequence design can be seen in Figure S11.

For the anti-contamination function experiments, an article in PDF format is used as the default real source file, and 100 000 target sequences are randomly selected. In the file size experiment, a PDF file consisting of four articles is utilized. For the file type experiments, all sequences encoded from each file type are included. The original version of the file in the file version control system experiments contains 84 940 target sequences.

Synthesizing and sequencing simulation

In our experiment, we have set the molecular copy number of each DNA sequence to be synthesized as 1000. We assume that these molecular copy number follow a normal distribution with a standard deviation of 100. Additionally, the default total error rate of each base on a synthesized DNA sequence is 0.3%, which is referred to Song et. al. [12], consisting of a substitution rate of 0.15%, an insertion rate of 0.075% and a deletion rate of 0.075%.

After simulating the synthesis with error occurrence, all the DNA sequences will be randomly shuffled, and then a sequencing depth will be specified to obtain the final sequencing data.

Anti-contamination using DNA-BF

The BF has stored information in an array with some positions marked as 1 and other positions marked as 0. The initial BF is an array with every position marked as 0. When a DNA sequence is processed, it is transformed into one or several values. These values will correspond to the positions in the array, which are then marked as 1. The number of values generated for each DNA sequence depends on the hash size, which determines the number of rounds of hash functions that need to be performed.

For each target sequence, the input parameters for each round of the hash function include the base information of the DNA sequence and the index of the round of the current hash function. The index of the round of the current hash function is used as the initial value to calculate the hash value for the current hash function. Additionally, each base of the DNA sequence is sequentially transformed into its corresponding ASCII value, which is then incorporated into the calculation along with the value calculated from the previous base in the hash function calculation. The resulting calculated value for each hash function is used to map the corresponding position in the array of the BF to 1. By performing these steps for each hash function, the target sequence is stored in the BF.

Once all the target sequences are stored in the BF, a c3omplete BF is generated. To check whether a DNA sequence belongs to the target sequences, you just need to calculate the hash value for the same number of times as during the storage process, and then check if every hash value mapping position in the array of the BF is set to 1. As long as there is a value mapping position in the array of the BF that is not set to 1, it means that the DNA sequence does not belong to the target sequences.

By detecting the sequencing data through this BF, all the target sequences included in the sequencing data and a small part of non-target sequences will be obtained. Since the non-target sequences are randomly generated, their coverage is much smaller than that of target sequences. These residual non-target sequences can be easily removed by artificially setting a coverage threshold.

Finally, the |$r_{fp}$| is artificially set and its calculation can be seen in |$r_{fp}$| in Table 1. In the case of |$n$| is certain, giving a value of |$r_{fp}$| as the minimum value, |$l$| and |$k$| will be determined since this function is an increasing function. The default pre-set |$r_{fp}$| is 0.001.

File version control system using DNA-BF

Different versions of the files are sourced from the scripts on different dates in the specified GitHub repository (https://github.com/iterative/dvc/tree/main/dvc.). All the files under this repository on June 1, 2023 are integrated into a txt file, and this file is treated as the original version file (version1). Other versions of the files are sourced from 2 June 2023 to 15 June 2023.

For each version of the files, we generate a Counting BF for it, which is similar to the generation of a BF. The key difference is that the mark of each position in the array is cumulative, which means the final CBF has stored information in an array with some positions marked as positive integers and other positions marked as 0. The DNA sequences generated from all versions of the files can be stored in one pool. When a specific version of the files is required, the corresponding CBF is used to detect the sequencing data. Then, employing the method of eliminating false positive sequences based on CBF, the false positive sequences from other versions can be effectively removed.

As the DNA sequences from different file versions have similar coverage, it becomes challenging to remove false positive DNA sequences belonging to different versions. This means that when obtaining one version of the files, there is a possibility of mixing files from other versions. To solve this problem, we propose using a CBF instead of a BF and developing a method to eliminate false positive sequences, ensuring the accuracy of a specific version of the files. This method aims to eliminate all the false positive sequences, but it may introduce a |$r_{tfn}$|⁠, which means there is a probability that some target sequences may also be eliminated. Fortunately, under suitable conditions, no target sequences will be mistakenly eliminated, or the number of target sequences to be eliminated will be significantly lower than the number of false positive sequences that will be detected if just using the BF.

Elimination of false positive information

Similar to the element deletion operation in the CBF, after obtaining the target sequences and false positive sequences through the CBF, the values in the array where these sequences hash mapping to are subtracted by 1, and then an array containing a lot of 0 and several negative integers will be got. Finally, based on this non-positive array, the sequences whose hash mapping positions in the array are all negative integers will be extracted and eliminated. In this way, most of the false positive sequences are successfully eliminated. The detailed pipeline is shown in the follow-up pseudo code.

Implementation of file version control system

As for the YYC, the pairing of binary sequences is random. This means that any modifications made to the input file would result in a complete alteration of the encoded DNA sequences for that file. This random pairing feature may deviate from our original intention of creating the file version control system, where only the modified parts should require new DNA sequences to be added to the pool in order to reduce the cost of synthesis. In addition to this, there is currently no strategy in place for allocating indexes to the modified DNA sequences. Therefore, we made adjustments for the YYC to achieve the file version control function in the DNA storage.

We introduce adjacent pairing of binary sequences, and set prime index and minor index to ensure the modified sequences have correct indexes to be allocated. We name the first index as prime index and remaining indexes as minor indexes. For example, when inserting a sequence between two existing sequences with only primary index 20 and 21 respectively, the inserted sequence will be allocated the index 20-1, which 20 is the prime index and 1 is the minor index. The length of the prime index is 20 nt and the length of the minor index is 14 nt with 4 nt for marking. The payload length is 120 nt if there is only prime index in a DNA sequence. To preserve the integrity of the encoding for DNA sequences before and after any modifications, it is crucial that the number of binary sequences within the modified part, preceding it, and succeeding it, were all even numbers. If the payload length of the last binary sequence exceeds the size of the actual binary stream, additional zeros must be appended to the front of the binary stream to satisfy the length of the final binary sequence. To distinguish these binary sequences with filled zeros from normal binary sequences, 15 zeros will be added after the indexes of the filled zeros binary sequences. Then, the next 7 bits are used to record the length of the binary stream. The binary stream is placed at the end of the binary sequences. If the remaining bits are not enough to store the binary stream, new binary sequences will be generated until all the binary streams have been allocated, and the number of the binary sequences within this part is an even number. So, more than 22 bits should be reserved to store binary stream in a binary sequence and the number of the minor indexes should not exceed 6.

Our system currently only supports encoding txt files. In addition to the normal parameters in YYC, such as coding rule and support base, our system also requires modification operations, modification files, match files for precise positioning operations and DNA sequence files of the last version.

Key Points

We apply Bloom Filter, a space-efficient probabilistic data structure, to DNA-based data storage, and achieve anti-contamination DNA reading combined with the significant coverage difference between target sequences and non-target sequences.
File version control in DNA-based data storage comes true using a variant structure of Bloom Filter, Counting Bloom Filter. This function greatly reduces the cost of synthesis as only modified parts need to be re-synthesized.
A method that can eliminate false positive information based on Counting Bloom Filter is proposed.
A file version control system codec based on YYC is developed to only recode the modified parts of each file version.

FUNDING

This work was supported by the National Key Research and Development Program to Z.P. (no. 2020YFA0712100), the National Natural Science Foundation of China to Z.P. (no. 32101182), the Shenzhen Science, Technology and Innovation Commission grant no. SGDX20220530110802015 to Z.P. and Tip-top Scientific and Technical Innovative Youth Talents of Guangdong Special Support Program to Y.S. (no. 2019TQ05Y876).

AUTHOR CONTRIBUTIONS

Y.L., H.Z. and Z.P. proposed the concepts and designed the experiments; H.Z. and Y.L. completed the codes; Y.C. conducted the experiments and deployed the codes; Y.L. analyzed the results; Y.L. drafted the manuscript; H.Z., Y.C., Y.S. and Z.P. revised the manuscript; Z.P. supervised the study.

DATA AVAILABILITY

The real source files for encoding are in https://github.com/BGI-SynBio/DNA-BF/tree/main/files. The source code for realization of the functions is in https://github.com/BGI-SynBio/DNA-BF. The source code for the file version control system codec based on YYC is in https://github.com/BGI-SynBio/YYC-FileVersionControl.

Author Biographies

Yiming Li received her Master's degree in genetics from Central South University in 2022. She is currently a research assistant in BGI. Her research interests in bioinformatics and algorithms of DNA-based data storage.

Haoling Zhang received his BEng degree in software engineering from Chongqing University of Technology in 2018. Following this, he held the position of Senior Algorithm Engineer at BGI Research until early 2024. He is currently a MS/PhD student at King Abdullah University of Science and Technology. His research interests include DNA-based data storage, robust machine learning, and bioinformatics.

Yuxin Chen received his BSc degree in biological technology from South China University of Technology in 2016. He is currently an algorithm engineer in Beijing Genomics Institute. His research interests include DNA storage, data compression and bioinformatics.

Yue Shen received her PhD degree in molecular biology from the University of Edinhurgh. She currently serves as the Chief Scientist of Synthetic Biology in BGI-Research. She is one of the key members in the synthetic yeast consortium (Sc2.0). Her research focuses on the development of DNA synthesis technologies and instruments, Synthetic genomics and its downstream applications, and DNA-based data storage.

Zhi Ping received his PhD degree from Nanyang Technological University, Singapore. He is currently the faculty in School of Medicine, The Chinese University of Hong Kong, Shenzhen, and a Chief Scientist in BGI-Research. His research interest lies on DNA-based data storage, DNA synthesis, bioinformatics algorithms and structural biology.

References

Church

Gao

Kosuri

Next-generation digital information storage in dna

Science

2012

;

337

(

6102

1628

–

Goldman

Bertone

Chen

, et al.

Towards practical, high-capacity, low-maintenance information storage in synthesized dna

Nature

2013

;

494

(

7435

–

Grass

Heckel

Puddu

, et al.

Robust chemical preservation of digital information on dna in silica with error-correcting codes

Angew Chem Int Ed

2015

;

(

2552

–

Google Scholar

Crossref

WorldCat

Blawat

Gaedke

Huetter

, et al.

Forward error correction for dna data storage

Procedia Comput Sci

2016

;

1011

–

Google Scholar

Crossref

WorldCat

Erlich

Zielinski

Dna fountain enables a robust and efficient storage architecture

Science

2017

;

355

(

6328

950

–

Press

Hawkins

Jones

, Jr, et al.

Hedges error-correcting code for dna storage corrects indels and allows sequence constraints

Proc Natl Acad Sci

2020

;

117

(

18489

–

Ping

Chen

Zhou

, et al.

Towards practical and robust dna-based data archiving using the yin–yang codec system

Nat Comput Sci

2022

;

(

234

–

Löchel

Welzel

Hattab

, et al.

Fractal construction of constrained code words for dna storage systems

Nucleic Acids Res

2022

;

(

e30

–

Rasool

Hong

Jiang

, et al.

Bo-dna: biologically optimized encoding model for a highly-reliable dna data storage

Comput Biol Med

2023

;

165

107404

10.

Zhang

Lan

Zhang

, et al.

Spider-web generates coding algorithms with superior error tolerance and real-time information retrieval capacity.

2022

. arXiv preprint arXiv:2204.02855.

11.

Guanjin

Yan

Huaming

Clover: tree structure-based efficient dna clustering for dna-based data storage

Brief Bioinform

2022

;

(

bbac336

Google Scholar

PubMed

OpenURL Placeholder Text

WorldCat

12.

Song

Geng

Gong

Z-Y

, et al.

Robust data storage in dna by de bruijn graph-based de novo strand assembly

Nat Commun

2022

;

(

5361

13.

Schwarz

Welzel

Kabdullayeva

, et al.

Mesa: automated assessment of synthetic dna fragments and simulation of dna synthesis, storage, sequencing and pcr errors

Bioinformatics

2020

;

(

3322

–

14.

Zhi

Zhang

Chen

, et al.

Chamaeleo: an integrated evaluation platform for dna storage

Synth Biol J

2021

;

(

412

Google Scholar

OpenURL Placeholder Text

WorldCat

15.

Yuan

Xie

Wang

Desp: a systematic dna storage error simulation pipeline

BMC Bioinformatics

2022

;

(

–

Google Scholar

PubMed

OpenURL Placeholder Text

WorldCat

16.

Organick

Ang

Chen

Y-J

, et al.

Random access in large-scale dna data storage

Nat Biotechnol

2018

;

(

242

–

17.

Lin

Volkel

Tuck

Keung

Dynamic and scalable dna-based information storage

Nat Commun

2020

;

(1):2981.

Google Scholar

OpenURL Placeholder Text

WorldCat

18.

Banal

Shepherd

Berleant

, et al.

Random access dna memory using boolean search in an archival file storage system

Nat Mater

2021

;

(

1272

–

19.

Tomek

Volkel

Indermaur

, et al.

Promiscuous molecules for smarter file operations in dna-based data storage. .

Nat Commun

2021

;

(

3518

20.

Bee

Chen

Y-J

Queen

, et al.

Molecular-level similarity search brings computing to dna data storage

Nat Commun

2021

;

(

4764

21.

Takahashi

Nguyen

Strauss

Ceze

Demonstration of end-to-end automation of dna data storage

Sci Rep

2019

;

(

4998

22.

Chengtao

Gao

, et al.

Electrochemical dna synthesis and sequencing on a single electrode with scalability for integrated data storage. Science

Advances

2021

;

(

eabk0100

Google Scholar

OpenURL Placeholder Text

WorldCat

23.

Lim

Yeoh

Kunartama

, et al.

A biological camera that captures and stores images directly into dna

Nat Commun

2023

;

(1):3921.

Google Scholar

OpenURL Placeholder Text

WorldCat

24.

Chengtao

Zhao

Liu

Uncertainties in synthetic dna-based data storage

Nucleic Acids Res

2021

;

(

5451

–

Google Scholar

PubMed

OpenURL Placeholder Text

WorldCat

25.

Reed

Solomon

Polynomial codes over certain finite fields

J Soc Ind Appl Math

1960

;

(

300

–

Google Scholar

Crossref

WorldCat

26.

Gallager

Low-density parity-check codes

IRE Trans Inf Theory

1962

;

(

–

Google Scholar

Crossref

WorldCat

27.

Luby

Lt codes

. In: IEEE (ed).

The 43rd Annual IEEE Symposium on Foundations of Computer Science, 2002. Proceedings

IEEE Computer Society

, Los Alamitos, California,

2002

, pp

271

–

28.

Rashtchian

Makarychev

Racz

, et al.

Clustering billions of reads for dna data storage

Adv Neural Inf Process Syst

2017

;

:3362–73.

Google Scholar

OpenURL Placeholder Text

WorldCat

29.

Xie

Zan

Chu

, et al.

Study of the error correction capability of multiple sequence alignment algorithm (mafft) in dna storage

BMC Bioinformatics

2023

;

(

–

Google Scholar

PubMed

OpenURL Placeholder Text

WorldCat

30.

Jindal

Liu

Review spam detection

. In: Williamson CL, Zurko ME, Patel-Schneider PF, Shenoy PJ (eds).

Proceedings of the 16th International Conference on World Wide Web

. ACM, New York, USA,

2007

, pp

1189

–

1190

31.

Kim

Song

Choi

B-Y

, et al.

Existing deduplication techniques

Data Deduplication for Data Optimization for Storage and Network Systems

. Springer International Publishing, New York, USA,

2017

, pp 23–76.

Google Scholar

OpenURL Placeholder Text

WorldCat

32.

Bloom

Space/time trade-offs in hash coding with allowable errors

Commun ACM

1970

;

(

422

–

Google Scholar

Crossref

WorldCat

33.

Jun

The beauty of mathematics in computer science

CRC Press, New York, United States

, Dieter Riebesehl (Lüneburg), zbMath,

2018

Google Scholar

Google Preview

OpenURL Placeholder Text

WorldCat

34.

Holley

Wittler

Stoye

Bloom filter trie: an alignment-free and reference-free data structure for pan-genome storage

Algorithms Mol Biol

2016

;

(

–

35.

Chen

Huaming

Multiple errors correction for position-limited dna sequences with gc balance and no homopolymer for dna-based data storage

Brief Bioinform

2023

;

(

bbac484

36.

Park

S-J

Kim

Jeong

, et al.

Reducing cost in dna-based data storage by sequence analysis-aided soft information decoding of variable-length reads

Bioinformatics

2023

;

(9):btad548.

Google Scholar

OpenURL Placeholder Text

WorldCat

37.

Adams

Storer

Miller

Analysis of workload behavior in scientific and historical long-term data repositories

ACM Trans. Storage

2012

;

(

–

Google Scholar

Crossref

WorldCat

38.

Kosuri

Church

Large-scale de novo dna synthesis: technologies and applications

Nat Methods

2014

;

(

499

–

507

39.

Chen

Y-J

Takahashi

Organick

, et al.

Quantifying molecular bias in dna data storage

Nat Commun

2020

;

(1):3264.

Google Scholar

OpenURL Placeholder Text

WorldCat

40.

Nguyen

Takahashi

Gupta

, et al.

Scaling dna data storage with nanoscale electrode wells

Sci Adv

2021

;

(48):eabi6714.

Google Scholar

OpenURL Placeholder Text

WorldCat

41.

Keki’c

Lietard

A canvas of spatially arranged dna strands that can produce 24-bit color depth

J Am Chem Soc

2023

;

145

(

22293

–

42.

Hoose

Vellacott

Storch

, et al.

Dna synthesis technologies to close the gene writing gap. Nature reviews

Chemistry

2023

;

(

144

–

Google Scholar

PubMed

OpenURL Placeholder Text

WorldCat

43.

Bender

Farach-Colton

Johnson

, et al. .

Don’t thrash: How to cache your hash on flash

. In: Ahmad I (ed).

3rd Workshop on Hot Topics in Storage and File Systems (HotStorage 11)

, USENIX Association, Berkeley, California,

2011

44.

Clerry

Compact hash tables using bidirectional linear probing

IEEE Trans Comput

1984

;

C-33

(

828

–

Google Scholar

Crossref

WorldCat

45.

Gervasio

JHDB

da Costa Oliveira

da Costa Martins

, et al.

How close are we to storing data in dna?

Trends Biotechnol

2023

;

(2):156–67.

Google Scholar

OpenURL Placeholder Text

WorldCat

46.

Jiashu

Dai

, et al.

A self-contained and self-explanatory dna storage system

Sci Rep

2021

;

(

18063

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (https://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact [email protected]

Download all slides

Notation	Description
\|$S$\|	A DNA sequence randomly generated by a particular coding scheme.
\|$\boldsymbol{S}^{\mathrm{i}}$\|	A set of generated DNA sequences.
\|$\boldsymbol{S}^{\mathrm{o}}$\|	A set of sequences obtained from \|$\boldsymbol{S}^{\mathrm{i}}$\| via a noise channel.
\|$\boldsymbol{S}^{\mathrm{r}}$\|	A set of sequences obtained from \|$\boldsymbol{S}^{\mathrm{o}}$\| through a DNA BF \|$F$\| (see below).
\|$\boldsymbol{S}^{\mathbb{T}}$\|	A set of sequences identified as false positive sequences by a DNA BF.
\|$n$\|	The size of a DNA sequence set, in this study, \|$n=\|\boldsymbol{S}^{\mathrm{i}}\|$\|⁠.
\|$\boldsymbol{a}_{l}^{m}$\|	An array of length \|$l$\| and the maximum value in each cell of this array is \|$m$\|⁠, \|$m=1$\| or \|$m>n$\|⁠.
\|$\boldsymbol{a}_{l}^{m}[i]$\|	\|$i$\|-th element of \|$\boldsymbol{a}_{l}^{m}$\|⁠.
\|$H$\|	A hash function.
\|$\boldsymbol{H}_{k}$\|	A function group composed of \|$k$\| hash functions.
\|$F$\|	A DNA BF \|$F=<n\|\boldsymbol{a}_{l}^{m}\|\boldsymbol{H}_{k}>$\|⁠. For a BF, \|$m=1$\|⁠.
\|$r_{fp}$\|	Pre-set false positive rate of a DNA BF. \|$r_{fp}=(1 - (1 - \frac{1}{l})^{nk})^{k}$\|
\|$r_{tfp}$\|	Actual false positive rate of a DNA BF. In practice, \|$r_{tfp}= \frac{\boldsymbol{S}^{\mathrm{r}} - \boldsymbol{S}^{\mathrm{i}}}{\boldsymbol{S}^{\mathrm{o}} - \boldsymbol{S}^{\mathrm{i}}}$\|⁠.
\|$r_{tfn}$\|	Actual false negative rate of a DNA BF. In practice, \|$r_{tfn}= \frac{\boldsymbol{S}^{\mathbb{T}} - \boldsymbol{S}^{\mathrm{i}}}{\boldsymbol{S}^{\mathrm{r}} - \boldsymbol{S}^{\mathrm{i}}}$\|⁠.
\|$\boldsymbol{M}^{\mathrm{nt}}$\|	Maximum coverage number of non-target sequences.
\|$\boldsymbol{M}^{\mathrm{t}}$\|	Minimum coverage number of target sequences.

Notation	Description
\|$S$\|	A DNA sequence randomly generated by a particular coding scheme.
\|$\boldsymbol{S}^{\mathrm{i}}$\|	A set of generated DNA sequences.
\|$\boldsymbol{S}^{\mathrm{o}}$\|	A set of sequences obtained from \|$\boldsymbol{S}^{\mathrm{i}}$\| via a noise channel.
\|$\boldsymbol{S}^{\mathrm{r}}$\|	A set of sequences obtained from \|$\boldsymbol{S}^{\mathrm{o}}$\| through a DNA BF \|$F$\| (see below).
\|$\boldsymbol{S}^{\mathbb{T}}$\|	A set of sequences identified as false positive sequences by a DNA BF.
\|$n$\|	The size of a DNA sequence set, in this study, \|$n=\|\boldsymbol{S}^{\mathrm{i}}\|$\|⁠.
\|$\boldsymbol{a}_{l}^{m}$\|	An array of length \|$l$\| and the maximum value in each cell of this array is \|$m$\|⁠, \|$m=1$\| or \|$m>n$\|⁠.
\|$\boldsymbol{a}_{l}^{m}[i]$\|	\|$i$\|-th element of \|$\boldsymbol{a}_{l}^{m}$\|⁠.
\|$H$\|	A hash function.
\|$\boldsymbol{H}_{k}$\|	A function group composed of \|$k$\| hash functions.
\|$F$\|	A DNA BF \|$F=<n\|\boldsymbol{a}_{l}^{m}\|\boldsymbol{H}_{k}>$\|⁠. For a BF, \|$m=1$\|⁠.
\|$r_{fp}$\|	Pre-set false positive rate of a DNA BF. \|$r_{fp}=(1 - (1 - \frac{1}{l})^{nk})^{k}$\|
\|$r_{tfp}$\|	Actual false positive rate of a DNA BF. In practice, \|$r_{tfp}= \frac{\boldsymbol{S}^{\mathrm{r}} - \boldsymbol{S}^{\mathrm{i}}}{\boldsymbol{S}^{\mathrm{o}} - \boldsymbol{S}^{\mathrm{i}}}$\|⁠.
\|$r_{tfn}$\|	Actual false negative rate of a DNA BF. In practice, \|$r_{tfn}= \frac{\boldsymbol{S}^{\mathbb{T}} - \boldsymbol{S}^{\mathrm{i}}}{\boldsymbol{S}^{\mathrm{r}} - \boldsymbol{S}^{\mathrm{i}}}$\|⁠.
\|$\boldsymbol{M}^{\mathrm{nt}}$\|	Maximum coverage number of non-target sequences.
\|$\boldsymbol{M}^{\mathrm{t}}$\|	Minimum coverage number of target sequences.

Notation	Description
\|$S$\|	A DNA sequence randomly generated by a particular coding scheme.
\|$\boldsymbol{S}^{\mathrm{i}}$\|	A set of generated DNA sequences.
\|$\boldsymbol{S}^{\mathrm{o}}$\|	A set of sequences obtained from \|$\boldsymbol{S}^{\mathrm{i}}$\| via a noise channel.
\|$\boldsymbol{S}^{\mathrm{r}}$\|	A set of sequences obtained from \|$\boldsymbol{S}^{\mathrm{o}}$\| through a DNA BF \|$F$\| (see below).
\|$\boldsymbol{S}^{\mathbb{T}}$\|	A set of sequences identified as false positive sequences by a DNA BF.
\|$n$\|	The size of a DNA sequence set, in this study, \|$n=\|\boldsymbol{S}^{\mathrm{i}}\|$\|⁠.
\|$\boldsymbol{a}_{l}^{m}$\|	An array of length \|$l$\| and the maximum value in each cell of this array is \|$m$\|⁠, \|$m=1$\| or \|$m>n$\|⁠.
\|$\boldsymbol{a}_{l}^{m}[i]$\|	\|$i$\|-th element of \|$\boldsymbol{a}_{l}^{m}$\|⁠.
\|$H$\|	A hash function.
\|$\boldsymbol{H}_{k}$\|	A function group composed of \|$k$\| hash functions.
\|$F$\|	A DNA BF \|$F=<n\|\boldsymbol{a}_{l}^{m}\|\boldsymbol{H}_{k}>$\|⁠. For a BF, \|$m=1$\|⁠.
\|$r_{fp}$\|	Pre-set false positive rate of a DNA BF. \|$r_{fp}=(1 - (1 - \frac{1}{l})^{nk})^{k}$\|
\|$r_{tfp}$\|	Actual false positive rate of a DNA BF. In practice, \|$r_{tfp}= \frac{\boldsymbol{S}^{\mathrm{r}} - \boldsymbol{S}^{\mathrm{i}}}{\boldsymbol{S}^{\mathrm{o}} - \boldsymbol{S}^{\mathrm{i}}}$\|⁠.
\|$r_{tfn}$\|	Actual false negative rate of a DNA BF. In practice, \|$r_{tfn}= \frac{\boldsymbol{S}^{\mathbb{T}} - \boldsymbol{S}^{\mathrm{i}}}{\boldsymbol{S}^{\mathrm{r}} - \boldsymbol{S}^{\mathrm{i}}}$\|⁠.
\|$\boldsymbol{M}^{\mathrm{nt}}$\|	Maximum coverage number of non-target sequences.
\|$\boldsymbol{M}^{\mathrm{t}}$\|	Minimum coverage number of target sequences.

Notation	Description
\|$S$\|	A DNA sequence randomly generated by a particular coding scheme.
\|$\boldsymbol{S}^{\mathrm{i}}$\|	A set of generated DNA sequences.
\|$\boldsymbol{S}^{\mathrm{o}}$\|	A set of sequences obtained from \|$\boldsymbol{S}^{\mathrm{i}}$\| via a noise channel.
\|$\boldsymbol{S}^{\mathrm{r}}$\|	A set of sequences obtained from \|$\boldsymbol{S}^{\mathrm{o}}$\| through a DNA BF \|$F$\| (see below).
\|$\boldsymbol{S}^{\mathbb{T}}$\|	A set of sequences identified as false positive sequences by a DNA BF.
\|$n$\|	The size of a DNA sequence set, in this study, \|$n=\|\boldsymbol{S}^{\mathrm{i}}\|$\|⁠.
\|$\boldsymbol{a}_{l}^{m}$\|	An array of length \|$l$\| and the maximum value in each cell of this array is \|$m$\|⁠, \|$m=1$\| or \|$m>n$\|⁠.
\|$\boldsymbol{a}_{l}^{m}[i]$\|	\|$i$\|-th element of \|$\boldsymbol{a}_{l}^{m}$\|⁠.
\|$H$\|	A hash function.
\|$\boldsymbol{H}_{k}$\|	A function group composed of \|$k$\| hash functions.
\|$F$\|	A DNA BF \|$F=<n\|\boldsymbol{a}_{l}^{m}\|\boldsymbol{H}_{k}>$\|⁠. For a BF, \|$m=1$\|⁠.
\|$r_{fp}$\|	Pre-set false positive rate of a DNA BF. \|$r_{fp}=(1 - (1 - \frac{1}{l})^{nk})^{k}$\|
\|$r_{tfp}$\|	Actual false positive rate of a DNA BF. In practice, \|$r_{tfp}= \frac{\boldsymbol{S}^{\mathrm{r}} - \boldsymbol{S}^{\mathrm{i}}}{\boldsymbol{S}^{\mathrm{o}} - \boldsymbol{S}^{\mathrm{i}}}$\|⁠.
\|$r_{tfn}$\|	Actual false negative rate of a DNA BF. In practice, \|$r_{tfn}= \frac{\boldsymbol{S}^{\mathbb{T}} - \boldsymbol{S}^{\mathrm{i}}}{\boldsymbol{S}^{\mathrm{r}} - \boldsymbol{S}^{\mathrm{i}}}$\|⁠.
\|$\boldsymbol{M}^{\mathrm{nt}}$\|	Maximum coverage number of non-target sequences.
\|$\boldsymbol{M}^{\mathrm{t}}$\|	Minimum coverage number of target sequences.

Month:	Total Views:
March 2024	71
April 2024	337
May 2024	167
June 2024	49
July 2024	62
August 2024	48
September 2024	54
October 2024	66
November 2024	59
December 2024	44
January 2025	53
February 2025	48
March 2025	56
April 2025	45
May 2025	4

Article Contents

DNA Bloom Filter enables anti-contamination and file version control for DNA-based data storage

Abstract

INTRODUCTION

RESULTS

Overview of DNA-BF

Anti-contamination by DNA-BF

Effectiveness of anti-contamination function

Robustness of anti-contamination function

File version control by DNA-BF

CBF in file version control system

Robustness evaluation of file version control system

DISCUSSION

METHODS AND MATERIALS

Sequencing data generation

Target sequences generation

Synthesizing and sequencing simulation

Anti-contamination using DNA-BF

File version control system using DNA-BF

Elimination of false positive information

Implementation of file version control system

FUNDING

AUTHOR CONTRIBUTIONS

DATA AVAILABILITY

Author Biographies

References

Supplementary data

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

Article Contents

DNA Bloom Filter enables anti-contamination and file version control for DNA-based data storage

Abstract

INTRODUCTION

RESULTS

Overview of DNA-BF

Anti-contamination by DNA-BF

Effectiveness of anti-contamination function

Robustness of anti-contamination function

File version control by DNA-BF

CBF in file version control system

Robustness evaluation of file version control system

DISCUSSION

METHODS AND MATERIALS

Sequencing data generation

Target sequences generation

Synthesizing and sequencing simulation

Anti-contamination using DNA-BF

File version control system using DNA-BF

Elimination of false positive information

Implementation of file version control system

FUNDING

AUTHOR CONTRIBUTIONS

DATA AVAILABILITY

Author Biographies

References

Supplementary data

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

This Feature Is Available To Subscribers Only