Still, I have thought of many fun things I could do with statistical learning: image classification and face recognition (I always want a photo management application to be like that! I take and collect so many photos each day), a CBIR tool (like “Search by Image” in Google Images, but works locally on my computer), a GIMP Script-Fu that does tricks like image synthesis via ConvNets, etc. The good part about applying learning methods in image processing is that, once you can extract feature descriptors, everything becomes statistically learnable (and you as a programmer don’t really need to have prior knowledge about what an object visually is: an apple, or a pen?).

Cryptography, from what I learned, is a differently fun story. For a perfectly secure encryption scheme: \[\Pr[M=m\,|\,C=c] = \Pr[M=m]\] that is, knowing the ciphertext \(c\) will not update any attacker’s belief about whatever the underlying message \(m\) is. Even if you have a statistical training model, it cannot learn anything from purely observations of ciphertexts. But this unbreakable level of security comes with a price: The key space must expose substantial entropy that is as high as the message space, thus the key length can be no shorter than the message length (given by Shannon’s theorem). In practice, the messages we sent do not usually have the highest entropy possible, and we can safely assume that the attackers’ computation ability is bounded by polynomial-time algorithms, thus, we as the cryptosystem designers need only to make up schemes that are assumed to be unbreakable (i.e., breakable with only a negligible probability) for any polynomial-time attackers. As we don’t know yet if there actually are any polynomial unsolvable cases (e.g., is P ≠ NP?), the proof of security would eventually rely on some unproven computational hardness assumptions: one-way functions exist, integer factorization is hard, discrete logarithm is hard, etc. If one can construct a provably secure scheme, it is guaranteed that statistical cryptanalysis would be theoretically impossible within polynomial time (except for side-channel attacks); of course, if the hardness assumption we made is proved invalid, then nothing but the one-time pad can be secure.

I might be writing one or two blog posts about cryptographic security from an information-theoretic perspective and some basic cryptanalysis on insecure schemes, but now is the time to move on with my unfinished courses about logic and programming. Before that, I feel that I should add a little bit philosophy to my wiki so as to refresh my viewpoint and methodology. And here it is.

(Philosophy is a sophisticated, highly arguable subject, so pardon me if there’s any inconsistency with your textbook.)

**Metaphysics**is about the fundamental nature of being and the world.**Ontology**, as a branch of metaphysics, studies about the basic categories of being of entities and their relations.- Epistemology is the study of knowledge.
- Where does knowledge come form?
**Empiricism**claims that knowledge comes from empirical evidence.**Rationalism**claims that knowledge requires justification by reasoning.**Skepticism**rejects the certainty in knowledge and claims that it is impossible to have an adequate justification of knowledge.

- How to resolve the
*regress problem*that the justification of knowledge is questioned ad infinitum?- In
**foundationalism**, a statement is inferred from a basis of unprovably sound premises.- This approach leads to the (axiomatic) foundations of mathematics and the constructivism.
- The fundamentals of modern
*mathematics*are the*axiomatic set theory*(which has several variations with different sets of axioms). - The fundamentals of modern
*probability theory*are the*Kolmogorov axioms*. - The computational hardness assumptions in cryptography may also be seen as a practice of foundationalism.

- In
**coherentism**, a statement holds true as long as it coheres with all other justified beliefs, including itself. - In
**infinitism**, a justification chain is allowed to be infinite, thus there could never be adequate justification for any statement in the chain (which is the point that it is often criticized for).

- In
- So, what is knowledge, actually?
**Knowledge**is*justified true belief*. (though questioned by the*Gettier problem*[1])

- How do we categorize our knowledge?
**A priori**knowledge is independent of empirical evidence. Examples: knowledge deduced by logical deductions or mathematical proofs.**A posteriori**knowledge is dependent of empirical evidence. Examples: knowledge induced by statistical inference or learning (either human or machine learning).

**Science**is the approach to filter out unreliable knowledge and gather together reliable knowledge as a**theory**.- What qualifies as a science?
- See the
*demarcation problem*.

- See the
- Scientific discovery is made of
**hypotheses**. A hypothesis is a proposed explanation or prediction method for a phenomenon.- Only formally proved or statistically tested hypotheses qualify as reliable knowledge.
**Epicurus’ principle of multiple explanations**: If multiple hypotheses are consistent with the observations, one should retain them all.**Occam’s razor**: Among all hypotheses consistent with the observations, choose the simplest.- Epicurus’ principle of multiple explanations and Occam’s razor find their uses in the learning theory, e.g., Occam’s razor bound.

- Where does knowledge come form?
- Logic is the study of reasoning and argument, which plays an essential part in gaining knowledge.
- How to reason / argue by inference?
**Deduction**is to infer a conclusion by deriving a logical consequence (“a priori”) from some premises using rules of inference in a formal system. Examples: proofs, a priori probabilities.**Induction**is to infer a probable conclusion (“a posteriori”) from the generalization of some observations. Examples: statistical inference and learning, inductive programming.**Abduction**is to infer a probable explanation from some observations. Examples: deep learning.

- How to reason / argue by inference?

We can talk about the philosophy of science (particularly, philosophy of mathematics and statistics) with the understanding of epistemology and logic: Are you a logician or a statistician? If a logician, does set theory or type theory suit you the best? If a statistician, are you a Bayesian or a frequentist?

(As a personally opinionated note, I often find myself subscribe to the skepticism the most. But that doesn’t mean that logical reasoning and statistical inference aren’t useful to me; they are. Extremely. So as not to go too far with this subjective topic, I’ll be focusing more on classical / modal logic in the next few weeks.)

**Papers:**

[1] E. L. Gettier, “Is justified true belief knowledge?” *analysis*, vol. 23, no. 6, pp. 121–123, 1963.

- What is the measure for
**information**?- Intuitively, if a sample appears to be more naturally “random”, then it may contain more “information” of interest since it takes a greater size of data (more bits) to describe. But how to measure this quantitatively?
- Probability-theoretic view:
*Shannon entropy*. - Algorithmic view:
*Kolmogorov complexity*. (TBD in February or March 2017)

- Basic information theory
**Shannon entropy**- For discrete random variable \(X\) with pmf \(p(x)\): \[\operatorname{H}(X) = -\sum_{x\in\mathcal{X}} p(x) \log p(x).\]
- For continuous random variable \(X\) with pdf \(f(x)\): \[\operatorname{H}(X) = -\int_\mathcal{X} f(x) \log f(x) dx.\] (also referred to as
*differential entropy*) - The notion of entropy is an extension of the one in statistical thermodynamics (Gibbs entropy) and the measure-theoretic entropy of dynamical systems.
- Obviously, the entropy is determined by the pmf/pdf, which depends on the parameters of the specific probability distribution.
- In the context of Computer Science, the logarithm in the formula is often taken to the base \(2\). Assume that we take a uniform binary string of length \(\ell\), then \[p(x) = 2^{-\ell}\] Thus, the entropy of the distribution is \[\operatorname{H}(X) = -\sum_{x\in\mathcal{X}} p(x) \log p(x) = - (2^\ell \cdot 2^{-\ell} \log_2 2^{-\ell}) = \ell\] which is just the length of this (\(\ell\)-bit) binary string. Therefore, the unit of information (when applying binary logarithm) is often called a
*bit*(also*shannon*). - For the
**joint entropy**\(\operatorname{H}(X,Y)\) and the**conditional entropy**\(\operatorname{H}(X\,|\,Y)\), the following equation holds: \[\operatorname{H}(X,Y) = \operatorname{H}(X\,|\,Y) + \operatorname{H}(Y) = \operatorname{H}(Y\,|\,X) + \operatorname{H}(X)\] Notice that if \(\operatorname{H}(X\,|\,Y) = \operatorname{H}(X)\), then \(\operatorname{H}(X,Y) = \operatorname{H}(X) + \operatorname{H}(Y)\) and \(\operatorname{H}(Y\,|\,X) = \operatorname{H}(Y)\). \(X\) and \(Y\) are said to be independent of each other in this case. **Mutual information**\(\operatorname{I}(X;Y) = \operatorname{H}(X) + \operatorname{H}(Y) - \operatorname{H}(X,Y) \geq 0\) (equality holds iff \(X\) and \(Y\) are independent). Unlike the joint entropy or the conditional entropy, this notion does not reflect an actual probabilistic event thus it is referred to as*information*(sometimes*correlation*) rather than entropy.

**Kullback-Leibler divergence (relative entropy)**\[\operatorname{D}_\mathrm{KL}(p\|q) = \sum_{x\in\mathcal{X}} p(x) \log \frac{p(x)}{q(x)}\] KL divergence is a measurement of the distance of two probability distributions \(p(x)\) and \(q(x)\).- If \(p = q\), \(\operatorname{D}_\mathrm{KL}(p\|q) = 0\). (Any distribution has a KL divergence of 0 with itself.)
- \(\operatorname{I}(X;Y) = \operatorname{D}_\mathrm{KL}(p(x,y)\|p(x)p(y))\).
**Cross entropy**\[\operatorname{H}(p,q) = \operatorname{H}(p) + \operatorname{D}_\mathrm{KL}(p\|q)\] Notice that cross entropy is defined on two distributions \(p\) and \(q\) rather than two random variables taking one distribution \(p\) (given by joint entropy \(\operatorname{H}(X,Y)\)).

- Basic probability theory
- Normal (Gaussian) distribution.
- (Univariate) \(X \sim \mathcal{N}(\mu,\sigma^2)\) \[f_X(x) = \frac{1}{\sqrt{2\pi\sigma^2}} e^{-\frac{(x-\mu)^2}{2\sigma^2}}\] where \(\mu\) is the mean, \(\sigma^2\) is the variance of the distribution.

(Multivariate) \(\boldsymbol x \sim \mathcal{N}_k(\boldsymbol\mu,\mathbf\Sigma)\) \[f(\boldsymbol x) = (2\pi)^{-k/2} |\mathbf\Sigma|^{-1/2} e^{-\frac{1}{2} (\boldsymbol x - \boldsymbol\mu)^\mathrm{T} \Sigma^{-1}(\boldsymbol x - \boldsymbol\mu)}\] where \(\boldsymbol\mu\) is the mean vector, \(\mathbf\Sigma\) is the covariance matrix of the distribution. *Maximum entropy*: normal distribution is the probability distribution that maximizes the entropy when the mean \(\mu\) and the variance \(\sigma^2\) are fixed.- The normal distribution does not have any shape parameter. Moreover, its skewness and excess kurtosis are always 0.

- (Univariate) \(X \sim \mathcal{N}(\mu,\sigma^2)\) \[f_X(x) = \frac{1}{\sqrt{2\pi\sigma^2}} e^{-\frac{(x-\mu)^2}{2\sigma^2}}\] where \(\mu\) is the mean, \(\sigma^2\) is the variance of the distribution.

- Normal (Gaussian) distribution.
- Basic descriptive statistics
*Descriptive statistics*describe the properties of data sets quantitatively, without making any further inference.*Population*vs.*sample*.- Three major descriptive statistics:
*Central tendency*: sample means (**arithmetic mean**\(\mu\), geometric, harmonic),**median**,**mode**, mid-range.- Arithmetic mean is an unbiased estimator of the population mean (expectation).
- Median and mode are most robust in the presence of outliers.

*Dispersion (or variability)*: minimum, maximum, range, IQR (interquartile range), maximum absolute deviation, MAD (mean absolute deviation),**sample variance**\(s^2\) with Bessel’s correction,**CV (coefficient of variance)**,**VMR (index of dispersion)**.- IQR and MAD are robust in the presence of outliers.
- Sample variance (with Bessel’s correction) is an unbiased estimator of the population variance.
- CV and VMR are sample standard deviation and sample variance normalized by the mean respectively, thus they are sometimes called
*relative standard deviation*and*relative variance*; they are*not*unbiased though.

*Shape*: sample skewness, sample excess kurtosis.- These statistics show how a sample deviates from normality, since the skewness and the excess kurtosis of a normal distribution are 0. The estimators could vary under different circumstances.

That is, the discrete uniform distribution maximizes the entropy for a random string. Since \(|\mathcal{X}| = 2^\ell\), we have \(p(x) = 2^{-\ell}\) and \(\operatorname{H}(X) = -\sum_{x\in\mathcal{X}} 2^{-\ell} \log_2 2^{-\ell} = \ell\) (bits). We conclude that the information that can be represented in a \(\ell\)-bit string is at most \(\ell\) bits. Some practical results include

- In general, pseudorandom data (assume no prior knowledge) cannot be losslessly compressed, e.g., the uniform key used in one-time pad must have \(\log_2 |\mathcal{M}|\) bits (lower bound) so as not to compromise the perfect secrecy. (Further topic:
*Shannon’s source coding theorem*) - Fully correct encoding/decoding of data, e.g., \(\mathsf{Enc}(m)\) and \(\mathsf{Dec}(c)\) algorithms in a private-key encryption scheme, must ensure that the probability distributions of \(m \in \mathcal{M}\) and \(c \in \mathcal{C}\) have the same entropy.
- An algorithm with finite input cannot generate randomness infinitely. Consider a circuit that takes the encoded algorithm with some input (\(\ell\) bits in total) and outputs some randomness, the entropy of the output data is at most \(\ell\) bits. (Further topic:
*Kolmogorov complexity*)

*Relative entropy (KL divergence)* quantifies how much information diverges between two sets of data. For data differencing, the KL divergence gives the minimum patch size that is needed to reconstruct target data (with distribution \(p(x)\)) given source data (with distribution \(q(x)\)).

In particular, if \(p(x) = q(x)\), which means that the two distributions are identical, we have \(\operatorname{D}_\mathrm{KL}(p\|q) = 0\). This follows our intuition that no information is gained or lost during data encoding/decoding. If \(p(x_0) = 0\) at \(x=x_0\), we take \(p(x) \log \frac{p(x)}{q(x)} = 0\), to justify the fact that the target data is trivial to reconstruct at this point, no matter how much information \(q(x)\) contains. However, if \(q(x_0) = 0\) at \(x=x_0\), we should take \(p(x) \log \frac{p(x)}{q(x)} = \infty\), so that the target data is impossible to reconstruct if we have only trivial \(q(x)\) at some point (unless \(p(x_0) = 0\)).

**Lemma 4.1. (Gibbs’ inequality)**^{2} *The KL divergence is always non-negative: \(\operatorname{D}_\mathrm{KL}(p\|q) \geq 0\).*

Informally, Lemma 4.1 simply states that in order to reconstruct target data from source data, either more information (\(\operatorname{D}_\mathrm{KL}(p\|q) > 0\)) or no further information (\(\operatorname{D}_\mathrm{KL}(p\|q) = 0\)) is needed.

**Theorem 4.2.** *Normal distribution \(\mathcal{N}(\mu,\sigma^2)\) maximizes the differential entropy for given mean \(\mu\) and variance \(\sigma^2\).*

**Proof.**^{3} Let \(g(x)\) be a pdf of the normal distribution with mean \(\mu\) and variance \(\sigma^2\). Let \(f(x)\) be an arbitrary pdf with the same mean and variance.

Consider the KL divergence between \(f(x)\) and \(g(x)\). By Lemma 4.1 (Gibbs’ inequality): \[\operatorname{D}_\mathrm{KL}(f\|g) = \int_{-\infty}^\infty f(x) \log \frac{f(x)}{g(x)} dx = \operatorname{H}(f,g) - \operatorname{H}(f) \geq 0\]

Notice that \[\begin{align*} \operatorname{H}(f,g) &= - \int_{-\infty}^\infty f(x) \log g(x) dx \\ &= - \int_{-\infty}^\infty f(x) \log \left( \frac{1}{\sqrt{2\pi\sigma^2}} e^{-\frac{(x-\mu)^2}{2\sigma^2}} \right) dx \\ &= \frac{1}{2} \left( \log(2\pi\sigma^2) + 1 \right) \\ &= \operatorname{H}(g) \end{align*}\]Therefore, \[\operatorname{H}(g) \geq \operatorname{H}(f)\] That is, the distribution of \(g(x)\) (Gaussian) always has the maximum entropy.

■

It is also possible to derive the normal distribution directly from the principle of maximum entropy, under the constraint such that \(\int_{-\infty}^\infty (x-\mu)^2f(x)dx = \sigma^2\).

The well-known central limit theorem (CLT) which states that the sum of independent random variables \(\{X_1,\dots,X_n\}\) tends toward a normal distribution may be alternatively expressed as the monotonicity of the entropy of the normalized sum: \[\operatorname{H}\left( \frac{\sum_{i=1}^n X_i}{\sqrt{n}} \right)\] which is an increasing function of \(n\). [1]

**Books:**

T. M. Cover and J. A. Thomas. *Elements of Information Theory*, 2nd ed.

**Papers:**

[1] S. Artstein, K. Ball, F. Barthe, and A. Naor, “Solution of shannon’s problem on the monotonicity of entropy,” *Journal of the American Mathematical Society*, vol. 17, no. 4, pp. 975–982, 2004.

- What is an algorithm?
- An algorithm is a
*computational method*for solving an*abstract problem*. - An algorithm takes as input a set \(I\) of
*problem instances*, and outputs a solution from the set \(S\) of*problem solutions*. - An algorithm is represented in a well-defined formal language.
- An algorithm must be able to be represented within a finite amount of time and space (otherwise it cannot be actually used for solving any problem).
- An algorithm can be simulated by any
*model of computation*:*Turing machine*is a model implemented through internal*states*.*λ-calculus*is a model based on pure*functions*.- All Turing-complete models are equivalent in their computational abilities.

- Computability: Not every abstract problem is solvable. Notably, there exists a decision problem for which some instances can neither be accepted nor rejected by any algorithm. (
*Undecidable problem*) - Complexity:
- The complexity class P is closed under polynomial-time reductions. Hence, proof by reduction can be a useful technique in provable security of cryptosystems.
- If one can prove that P = NP, then one-way functions do not exist. This would invalidate the construction of cryptographically secure pseudorandom generators (PRG). (
*Pseudorandom generator theorem*)

- In many scenarios, we assume that an algorithm acts as a stateless computation and takes independent and identically distributed inputs. It differs from a computer program conceptually.
- An algorithm can be either
*deterministic*or*probabilistic*.- For probabilistic algorithms, the source of randomness may be from:
- External (physical) input of high entropy.
- Pseudorandomness: Since everything computational is deterministic, the existence of pseudorandomness relies on the (assumed) existence of one-way functions and PRGs.

- For probabilistic algorithms, the source of randomness may be from:

- An algorithm is a

Generally, pseudorandom generators used in probabilistic algorithms yield random bits according to the uniform distribution, so it is worth mentioning:

- Basic probability theory
- Discrete uniform distribution. \(\mathcal{U}\{a,b\}\). The probability distribution where a finite number of values are equally likely to be observed (with probability \(\frac{1}{b-a+1}\)).

Cryptographic schemes are defined as tuples of deterministic or probabilistic algorithms:

- Principles of modern cryptography
- Formal description of a
*private-key encryption scheme*\(\Pi=(\mathsf{Gen},\mathsf{Enc},\mathsf{Dec})\) with message space \(\mathcal{M}\).- \(\mathsf{Gen}\), \(\mathsf{Enc}\), \(\mathsf{Dec}\) are three algorithms.
- Correctness: \(\mathsf{Dec}_k(\mathsf{Enc}_k(m)) = m\).
- For the correctness equality to hold, \(\mathsf{Dec}\) should be deterministic.
- Assume that we have access to a source of randomness, \(\mathsf{Gen}\) should choose a key at random thus is probabilistic. If \(\mathsf{Gen}\) is deterministic and always generate the same key, such an encryption scheme is of no practical use and easy to break.
- \(\mathsf{Enc}\) can be either deterministic (e.g., as in one-time pads) or probabilistic. Later we will see that for an encryption scheme to be CPA-secure, \(\mathsf{Enc}\) should be probabilistic.

**Kerchhoffs’ principle (Shannon’s maxim)**claims that a cryptosystem should be secure even if the scheme \((\mathsf{Gen},\mathsf{Enc},\mathsf{Dec})\) is known to the adversary. That is, security should rely solely on the secrecy of the private key.- Provable security of cryptosystems requires:
- Formal definition of security;
- Minimal assumptions;
- Rigorous proofs of security.

- Common attacks and notions of security:
**Ciphertext-only attack**.- A cryptosystem is said to be
*perfectly secret*if it is theoretically unbreakable under ciphertext-only attack. - A cryptosystem is said to be
*computationally secure*if it is resistant to ciphertext-only attack (by any polynomial-time adversary).

- A cryptosystem is said to be
**Known-plaintext attack (KPA)**. A cryptosystem is*KPA-secure*if it is resistant to KPA.- KPA-security implies ciphertext-only security.

**Chosen-plaintext attack (CPA)**. A cryptosystem is*CPA-secure*(or*IND-CPA*) if it is resistant to CPA.- IND-CPA implies KPA-security.

**Chosen-ciphertext attack (CCA)**. A cryptosystem is*CCA-secure*(or*IND-CCA1*) if it is resistant to CCA; furthermore, a cryptosystem is*IND-CCA2*if it is resistant to adaptive CCA (where the adversary may make further calls to the oracle, but may not submit the challenge ciphertext).- IND-CCA1 implies IND-CPA.
- IND-CCA2 implies IND-CCA1. Thus, IND-CCA2 is the strongest of above mentioned definitions of security.

- Formal description of a
- Perfect secrecy
- Two equivalent definitions: (proof of equivalence uses Bayes’ theorem)
- \(\Pr[M=m\,|\,C=c] = \Pr[M=m]\). (Observing a ciphertext \(c\) does not leak any information about the underlying message \(m\))
- \(\Pr[\mathsf{Enc}_K(m)=c] = \Pr[\mathsf{Enc}_K(m')=c]\). (The adversary has no bias when distinguishing two messages if given only the ciphertext \(c\))

**Perfect indistinguishability**defined on adversarial indistinguishability experiment \(\mathsf{PrivK}_{\mathcal{A},\Pi}^\mathsf{eav}\):- \(\Pr[\mathsf{PrivK}_{\mathcal{A},\Pi}^\mathsf{eav} = 1] = \frac{1}{2}\). (No adversary can win the indistinguishability game with a probability better than random guessing)

- Perfect indistinguishability is equivalent to the definition of perfect secrecy.
- The adversarial indistinguishability experiment is a very useful setting in defining provable security, e.g., the definition of computational indistinguishability: (for arbitrary input size \(n\))
- \(\Pr[\mathsf{PrivK}_{\mathcal{A},\Pi}^\mathsf{eav}(n) = 1] \leq \frac{1}{2} + \mathsf{negl}(n)\) where \(\mathsf{negl}(n)\) is a negligible function.

- Perfect secrecy implies that \(|\mathcal{K}| \geq |\mathcal{M}|\), i.e., the key space must be larger than the message space. If \(|\mathcal{K}| < |\mathcal{M}|\), then the scheme cannot be perfectly secure.
**Shannon’s theorem**: If \(|\mathcal{K}| = |\mathcal{M}| = |\mathcal{C}|\), an encryption scheme is perfectly secret iff:- \(k \in \mathcal{K}\) is chosen uniformly.
- For every \(m \in \mathcal{M}\) and \(c \in \mathcal{C}\), there exists a unique \(k \in \mathcal{K}\) such that \(\mathsf{Enc}_k(m) = c\).

- Two equivalent definitions: (proof of equivalence uses Bayes’ theorem)

A brief, formalized overview of some classical ciphers, and their security:

- One-time pad (Vernam cipher): XOR cipher when \(|\mathcal{K}| = |\mathcal{M}|\).
- One-time pad is perfectly secret. The proof simply follows from Bayes’ theorem. (Also verified by Shannon’s theorem. While one-time pad was initially introduced in the 19th century and patented by G. Vernam in 1919, it was not until many years later Claude Shannon gave a formal definition of information-theoretical security and proved that one-time pad is a perfectly secret scheme in his groundbreaking paper. [1])
- One-time pad is deterministic. Moreover, it is a reciprocal cipher (\(\mathsf{Enc} = \mathsf{Dec}\)).
- One-time pad is
*not*secure when the same key is applied in multiple encryptions, and it is*not*CPA-secure. In fact, an adversary can succeed in such indistinguishability experiments with probability 1.

- Insecure historical ciphers:
- Shift cipher: Defined with key space \(\mathcal{K}=\{0,\dots,n-1\}\). (\(n=|\Sigma|\))
- \(|\mathcal{K}|=n\), \(|\mathcal{M}|=n^\ell\).
- Cryptanalysis using frequency analysis.

- Substitution cipher: Defined with key space \(\mathcal{K} = \mathfrak{S}_\Sigma\) (symmetric group on \(\Sigma\)).
- \(|\mathcal{K}|=n!\), \(|\mathcal{M}|=n^\ell\).
- Cryptanalysis using frequency analysis.

- Vigenère cipher (poly-alphabetic shift cipher): Like (mono-alphabetic) shift cipher, but the key length is an (unknown) integer \(t\).
- \(|\mathcal{K}|=n^t\), \(|\mathcal{M}|=n^\ell\). (Typically \(t \ll \ell\))
- Cryptanalysis using Kasiski’s method, index of coincidence method and frequency analysis.

- Shift cipher: Defined with key space \(\mathcal{K}=\{0,\dots,n-1\}\). (\(n=|\Sigma|\))

Lessons learned from these classical ciphers: While perfect secrecy is easy to achieve (one-time pads), designing practical cryptographic schemes (with shorter keys, and computationally hard to break) can be difficult.

The construction of private-key encryption schemes involves probabilistic algorithms. We simply assume that an unlimited supply of independent, unbiased random bits is available for these cryptographic algorithms. But in practice, this is a non-trivial issue, as the source of randomness must provide high-entropy data so as to accommodate cryptographically secure random bits.

In the perfectly secret scheme of one-time pads, the key generation algorithm \(\mathsf{Gen}\) requires the access to a source of randomness in order to choose the uniformly random key \(k \in \mathcal{K}\). Practically, high-entropy data may be collected via physical input or even fully written by hand with human labor.

Theoretically, without external intervention, we have:

**Conjecture 3.1.** *Pseudorandom generators exist.*

**Theorem 3.2. (Pseudorandom generator theorem)** *Pseudorandom generators exist if and only if one-way functions exist.*

Pseudorandomness is also a basic construction in CPA-secure encryption algorithms (\(\mathsf{Enc}\)), e.g., in stream ciphers and block ciphers.

So what is an acceptable level of pseudorandomness, if we are not sure whether such generators theoretically exist? Intuitively, if one cannot distinguish between a “pseudorandom” string (generated by a PRG) and a truly random string (chosen according to the uniform distribution), we have confidence that the PRG is a good one. Various statistical tests have been designed for testing the randomness of PRGs.

It holds true that:

**Corollary 3.3.** By redefining the key space, we can assume that any encryption scheme \(\Pi=(\mathsf{Gen},\mathsf{Enc},\mathsf{Dec})\) satisfies

- \(\mathsf{Gen}\) chooses a uniform key.
- \(\mathsf{Enc}\) is deterministic.

If so, why do we still need probabilistic \(\mathsf{Enc}\) in CPA-secure encryptions? Can’t we just make \(\mathsf{Enc}\) deterministic while still being CPA-secure?

The first thing to realize is that chosen-plaintext attacks are geared towards multiple encryptions (with the same secret key \(k\)), so when the adversary obtains a pair \((m_0, c_0)\) such that \(\Pr[C=c_0\,|\,M=m_0] = 1\), *the key is already leaked*. (Recall that the adversary knows the *deterministic* algorithm \(\mathsf{Enc}_k\), thus reversing \(k\) from known \(m_0\) and \(c_0\) can be quite feasible; e.g., in a one-time pad, \(k = m_0 \oplus c_0\).) The only way to get around this is make \(\mathsf{Enc}_k\) *probabilistic* (constructed from a *pseudorandom function*), such that an adversary cannot reverse the key efficiently within polynomial time.

Note that perfect secrecy is not possible under CPA, since there is a small possibility that the adversary will reverse the key (by, for example, traversing an exponentially large lookup table of all random bits) and succeed in the further indistinguishability experiment with a slightly higher (but negligible) probability.

One-time pad is one of the most (provably) secure encryption schemes, and its secrecy does not rely on any computational hardness assumptions. However, it requires that \(|\mathcal{K}| \geq |\mathcal{M}|\) (which in fact is a necessary condition for any perfectly secret scheme), thus its real-world use is limited.

The one-time key \(k\) (uniformly chosen from the key space \(\mathcal{K}\)) may *not* be simply reused in multiple encryptions. Assume that \(|\mathcal{K}| = |\mathcal{M}|\), for encryptions of \(n\) messages, the message space is expanded to size \(|\mathcal{M}|^n\), while the key space remains \(\mathcal{K}\), thus we have \(|\mathcal{K}| < |\mathcal{M}|^n\). Such a degraded scheme (many-time pad) is theoretically insecure and vulnerable to several practical cryptanalyses.

A historical exploit of the vulnerability of many-time pad occurred in the VENONA project, where the U.S. Army’s Signal Intelligence Service (later the NSA) aimed at decrypting messages sent by the USSR intelligence agencies (KGB) over a span of 4 decades. As the KGB mistakenly reused some portions of their one-time key codebook, the SIS was able to break a good amount of the messages.^{1}

**Books:**

J. Katz and Y. Lindell, *Introduction to Modern Cryptography*, 2nd ed.

**Papers:**

[1] C. E. Shannon, “Communication theory of secrecy systems,” *Bell system technical journal*, vol. 28, no. 4, pp. 656–715, 1949.

R. L. Benson, “The Venona story.” https://www.nsa.gov/about/cryptologic-heritage/historical-figures-publications/publications/coldwar/assets/files/venona_story.pdf↩

- Basic probability theory
- An
**experiment**has various**outcomes**. The set of all probable outcomes constitute the**sample space**of that experiment. - Any
*measurable*subset of the sample space \(\Omega\) is known as an**event**. - A
**probability measure**is a real-valued function defined on a set of events \(\mathcal{F}\) in a probability space \((\Omega,\mathcal{F},\Pr)\) that satisfies measure properties such as countable additivity. (See**Kolmogorov’s axioms**.) - The
**union bound**(Boole’s inequality) follows from the fact that a probability measure is σ-sub-additive. *Events*can be*independent*. The following conditions hold equivalently for any independent events:- \(\Pr[A_1 \cap A_2] = \Pr[A_1] \cdot \Pr[A_2]\)
- \(\Pr[A_1|A_2] = \Pr[A_1]\)

**Bayes’ theorem**and the**law of total probability**describe the basic properties of conditional probability.- A
**random variable**is a mapping that maps a*value*to an*event*. Hence, we have probability measure defined on random variables, such as \(\Pr[X=x]\).- For
*discrete*random variables, a**probability mass function (pmf)**determines a**discrete probability distribution**. - For
*continuous*random variables, a**probability density function (pdf)**determines a**continuous probability distribution**.

- For
*Random variables*can be*uncorrelated*. (\(\operatorname{Cov}(X,Y)=0 \iff \operatorname{E}[XY] = \operatorname{E}[X] \cdot \operatorname{E}[Y]\).)*Independent*random variables are uncorrelated.- However, uncorrelated random variables are not necessarily independent.

- A
*distribution*can be presented using**moments**:**Expectation (mean)**\(\operatorname{E}[X]\): first raw moment.**Variance**\(\operatorname{Var}(X)\): second central moment.**Skewness**\(\operatorname{Skew}(X)\): third standardized moment.**Kurtosis**\(\operatorname{Kurt}(X)\): fourth standardized moment.- For a bounded distribution of probability, the collection of all the moments (of all orders) uniquely determines the distribution.
- Some distributions, notably Cauchy distributions, do not have their moments defined.

**Concentration inequalities**provide bounds on how a random variable deviates from some value (usually one of its*moments*).**Markov’s inequality**is the simplest and weakest probability bound.**Chebyshev’s inequality**provides an upper bound on the probability that a random variable deviates from its expectation.**Chernoff bound**is stronger than Markov’s inequality.**Hoeffding’s inequality**provides an upper bound on the probability that the sum of random variables deviates from its expectation. It’s also useful for analyzing the number of required samples needed to obtain a confidence interval.

- Some common
*discrete probability distributions*:**Bernoulli distribution**. Special case of Binomial distribution: \(\text{B}(1,p)\).**Binomial distribution**\(\text{B}(n,p)\). Given number of draws \(n\), the distribution of the number of successes.**Geometric distribution**\(\text{Geom}(p)\). Special case of negative binomial distribution: \(\text{NB}(1,1-p)\).**Negative binomial distribution**\(\text{NB}(r,p)\). Given number of failures \(r\), the distribution of the number of successes.

- An

Intuitively, probability is a measure of uncertainty. Mathematically, probability is a real-valued function defined on a set of events in a probability space that satisfies measure properties such as countable additivity (or simply, *measure* on probability space).

Typically, a probability density function (pdf) or a probability mass function (pmf) determines a distribution in the probability space.

**Example 2.1.** Consider the wave function of a particle: \[\Psi(x,t)\] where \(x\) is position and \(t\) is time.

If the particle’s position is measured, its location cannot be determined but is described by a probability distribution: The probability that the particle is found in \([x, x+\Delta x]\) is \[\Delta\Pr = |\Psi(x,t)|^2 \Delta x\]

The square modulus of the wave function (which is real-valued, non-negative) \[\left|\Psi(x, t)\right|^2 = {\Psi(x, t)}^{*}\Psi(x, t) = \rho(x, t)\] is interpreted as the pdf.

Since the particle must be found somewhere, we have the normalization condition: (by the assumption of unit measure) \[\int\limits_{-\infty}^\infty |\Psi(x,t)|^2 dx = 1\]

Distributions are also called generalized functions in analysis. It expands the notion of functions to functions whose derivatives may not exist in the classical sense. Thus, it is not uncommon that many probability distributions cannot be described using classical (differentiable) functions. The Dirac delta function \(\delta\) (which is a generalized function) is often used to represent a discrete distribution, or a partially discrete, partially continuous distribution, using a pdf.

Bayes’ theorem forms the basis for *Bayesian inference*, which is an important method of statistical inference that updates the probability for a *hypothesis* as more evidence or information becomes available.

Hypotheses can also be fallacies. In Bayesian inference, if one can make the assumption that every event occurs independently and the probability is identically distributed throughout lasting trials, it is clear to see that some common beliefs are mistaken.

**Gambler’s fallacy (Monte Carlo fallacy).** If an outcome occurs more frequently than normal during some period, it will happen less frequently in the future; contrariwise, if an outcome happens less frequently than normal during some period, it will happen more frequently in the future. This is presumed to be a means of *balancing* nature.

Gambler’s fallacy is considered a fallacy if the probability of outcomes is known to be independently, identically distributed. Assume that the future (the probability of event \(A_2\)) has no effect on the past (the probability of event \(A_1\)), we have \(\Pr[A_1|A_2] = \Pr[A_1]\). From Bayes’ theorem, it holds true that \[\Pr[A_2|A_1] = \Pr[A_2]\] That is, past events should not increase or decrease our confidence in a future event.

**Hot-hand fallacy.** A person who has experienced success with a seemingly random event has a greater chance of further success in additional attempts. That is, if an outcome occurs more frequently than normal during some period, it will also happen frequently in the future.

If psychological factors can be excluded, then hot-hand fallacy is a fallacy caused by people’s confirmation bias. Like the gambler’s fallacy, if we can’t assume that the probability of outcomes is independently, identically distributed, we can’t simply conclude that this belief is mistaken.

**Inverse gambler’s fallacy.** If an unlikely outcome occurs, then the trials must have been repeated many times before.

Assume that the past (the probability of event \(A_1\)) has no effect on the future (the probability of event \(A_2\)), we have \(\Pr[A_2|A_1] = \Pr[A_2]\). From Bayes’ theorem, it holds true that \[\Pr[A_1|A_2] = \Pr[A_1]\] That is, our confidence in \(A_1\) should remain unchanged after we observe \(A_2\).

**Fallacies of hasty generalization and slothful induction (law of small numbers).** Informal fallacies reaching an inductive generalization based on insufficient evidence, or denying a reasonable conclusion of an inductive argument.

Statistically saying, sampling from a small group can lead to misbeliefs that fail to hold for the entire population, if hypothesis testing is not carefully conducted.

**Theorem 2.2. (Law of large numbers)** Let \(X_1, \dots, X_n\) be an infinite sequence of i.i.d. Lebesgue integrable random variables with fixed expectation \(\operatorname{E}[X_1] = \cdots = \operatorname{E}[X_n] = \mu\). Define the sample average \[\overline{X}_n = \frac{1}{n}(X_1 + \dots + X_n)\]

**(Weak law of large numbers; Khintchine’s law)**The sample average converges in probability towards the expectation: \[\lim_{n\to\infty} \Pr[|\overline{X}_n - \mu| > \varepsilon] = 0\]**(Strong law of large numbers)**The sample average converges*almost surely*to the expectation: \[\Pr[\lim_{n\to\infty} \overline{X}_n = \mu] = 1\]

Chebyshev’s inequality provides an upper bound on the probability that a random variable deviates from its expected value. Thus, it may be used as a proof for the weak law of large numbers.

The expected value of a random variable \(X\): \[\operatorname{E}[X] = \sum_{x \in \mathcal{X}} x \Pr[X=x]\] While it seemingly gives an estimate on how people would “expect” a random variable to take its value, it can sometimes lead to counterintuitive results, as shown by the following paradox.

**St. Petersburg Paradox.**^{1} A casino offers a game of chance for a gambler to flip a fair coin until it comes up tails. The initial stake starts at \(2\) dollars and is doubled every time heads appears. The first time tails appears, the game ends and the gambler wins whatever is in the pot. Thus if the coin comes up tails the first time, the gambler wins \(2^1=2\) dollars, and the game ends. If the coin comes up heads, the coin is flipped again. If the coin comes up tails the second time, the gambler wins \(2^2=4\) dollars, and the game ends. If the coin comes up heads again, the coin is flipped again. If the coin comes up tails the third time, the gambler wins \(2^3=8\) dollars, and the game ends. So on and so like. Eventually the gambler wins \(2^k\) dollars, where \(k\) is the number of coin flips until tails appears. (It is easy to see that \(k\) satisfies the geometric distribution.) What would be a fair price to pay the casino for entering such a game? (Assume that there is no house edge)

\(k\)th coin flip | \(\Pr[\text{Tails}]\) | Stake ($) | Expected payoff ($) |
---|---|---|---|

1 | \(\frac{1}{2}\) | 2 | 1 |

2 | \(\frac{1}{4}\) | 4 | 1 |

3 | \(\frac{1}{8}\) | 8 | 1 |

4 | \(\frac{1}{16}\) | 16 | 1 |

5 | \(\frac{1}{32}\) | 32 | 1 |

… | … | … | … |

k | \((1/2)^k\) | \(2^k\) | 1 |

The price should be made equal to the expected value that a gambler wins the stake, which is \[\operatorname{E}[\text{Payoff}] = \sum_{k=1}^{+\infty} \left(\frac{1}{2}\right)^k \cdot 2^k = \sum_{k=1}^{+\infty} 1 = +\infty\]

If a rational gambler pays for entering a game if and only if its average payoff is larger than its price, then he would pay any price to enter this game (since the expected payoff of this game is infinitely large). But in reality, few of us are willing to pay even tens of dollars to enter such a game. What went wrong? Furthermore, if *mathematical* expectation does not reflect correctly what people expect from a game, how to quantify the “*true*” expectation?

The St. Petersburg paradox was initially stated by Nicolas Bernoulli in 1713. There are several proposed approaches for solving the paradox, including the expected utility theory with the hypothesis of diminishing marginal utility [1], and the cumulative prospect theory. However, none of them is purely probability theoretical, as they require the use of hypothesized economic/behavioral models.

In probability theory, the variance of a random variable \(X\) is defined as \[\operatorname{Var}(X) = \operatorname{E}[(X-\mu)^2] = \frac{1}{N} \sum_{i=1}^N (X_i-\bar{X})^2\]

In statistics, when calculating the sample variance in order to give an estimation of the population variance, and the population mean is unknown, Bessel’s correction^{2} (use of \(N-1\) instead of \(N\)) is often preferred: \[s^2 = \frac{1}{N-1} \sum_{i=1}^N (X_i-\bar{X})^2\]

A few remarks and caveats:

- Bessel’s correction is only necessary when the population mean is unknown and estimated as the sample mean.
- Without Bessel’s correction, the estimated variance would be
*biased*; the biased sample variance \(s_n^2\) tends to be much smaller than the population variance \(\sigma^2\), whether the sample mean is smaller or larger than the population mean. - Bessel’s correction does not yield an unbiased estimator of standard deviation, only variance and covariance.
- The corrected estimator often has a larger mean squared error (MSE).

**Books:**

M. Mitzenmacher and E. Upfal, *Probability and Computing: Randomized Algorithms and Probabilistic Analysis*.

M. Baron, *Probability and Statistics for Computer Scientists*, 2nd ed.

**Articles:**

[1] R. Martin, “The st. petersburg paradox,” in *The stanford encyclopedia of philosophy*, Summer 2014., E. N. Zalta, Ed. https://plato.stanford.edu/archives/sum2014/entries/paradox-stpetersburg/; Metaphysics Research Lab, Stanford University, 2014.

Yes, it bugs me when I can’t restore my last open tabs and I want my old session so badly. Remembering last tabs, if I get the history right, was a feature first introduced by Google Chrome, and soon it started to play an indispensable part in my daily workflow. I’m a multitasker, but the computing resource of my laptop is very limited – Say, if I have a session in which I am working on a homework report, having loads of folders, web pages and editor buffers open and those can fill up gigabytes of RAM easily, then I realize that I will need to compile something really hard-core, or maybe just take a short rest and do some random surfing on the web, certainly I would rather close all those engrossing processes for the time being, hoping that they could continue with all the open tabs I left off.

It’s mainly four types of applications that account for so-called “work sessions” for me:

- Terminal emulator
- File manager
- Web browser
- Text editor

Terminals don’t take up a lot of memory, so I wouldn’t mind leaving them open. Typical browsers, including Chromium and Firefox, do support session resuming (and there are even extensions which allow you to save current tabs and recover them at any later time). Any decent text editor (or IDE) may also be configured to remember open tabs / sessions. After all, average file managers fail to meet my basic needs of productivity.

I’m on GNOME 3, but currently using the Caja file manager – ever since Nautilus 3.6 decided to remove two or three features I found important to me (compact mode, backspace navigation) and introduced an awkward, smug “search-whatever-shit-as-you-type” feature.

File managers I’ve tried so far:

- Nautilus (GNOME). As said, already rendered unusable for me.
- Pantheon. Like Nautilus, it doesn’t feature a compact mode either.
- Nemo (Cinnamon). Nope, segfaults too often.
- Caja (MATE). It’s OK, just what I’m using right now.
- Open issue for saving sessions: https://github.com/mate-desktop/caja/issues/523

- Dolphin (KDE). OK, unless it’s from the foreign land of KDE.
- Open issue for saving sessions: https://bugs.kde.org/show_bug.cgi?id=246028

- Konqueror (KDE). It’s both a web browser and a file manager, and it’s the only one I know that can save / restore open tabs. Unfortunately it has only limited file management functionality. (sufficient as a
*file viewer*, though?)

Among all above, I settled down with Caja, simply because there was no reasonably good alternative. Still, I’m wishing for something that can save session states for me. After doing a little research, I realized that:

- There is no commonly adopted protocol addressing this issue. Not even on GNOME.
- There is EggSMClient, but state saving is implemented on the X (desktop) session level thus only available on the XSMP backend. It works when you logout your desktop session and login, but not when you close the window and restart the application again.
- It is ultimately the application itself which must maintain its session states and restore them when required.

Let’s take the issue into more technical details. On Caja (or other similarly GTK+/GLib-based file managers), one need to implement:

- On the
`destroy`

callback of the main`GtkObject`

, all last remaining session data (e.g., internal states about open tabs, windows) must be saved to disk. (after the main loop ends there’s no more chance to get this information back) - On GUI initialization, read last session data (if exist) from disk, and reopen saved tabs as well as windows.
- On the event of changing state (e.g., creating or closing tab/window, repositioning tabs), session data are updated respectively and, optionally, saved to disk.

With `caja_application_get_session_data()`

, making a quick workaround that enables Caja to save and restore a session is somewhat trivial labor; however, it seems Caja doesn’t record the correct (spatial) ordering of tabs in its session data – so I wouldn’t consider this as a satisfying solution to the issue, and I have no intent to send such an incomplete patch to Caja. Nevertheless, it’s better than nothing, and, if ordering of tabs really matters, it would be feasible to write a wrapper script that manipulates the XML file in `$HOME/.config/caja/last-session`

.

And here goes the patch: (Applied to Caja 1.16.1; definitely UNWARRANTED)

`title`

, `author`

and `date`

) into a properly typeset PDF document:
`$ pandoc src.md -o out.pdf`

However, Markdown is not TeX. *Not even close.* Once you need to have some bleeding edge control over the typesetting outcome, or perhaps just a little refinement on its LaTeX templating, you’ll soon notice that Pandoc has its quirks and gotchas. I’ve been utilizing Pandoc in all my serious academic writing (incl. homework reports) for years, ever since I gave up on learning more about the overwhelmingly sophisticated TeX ecosystem and turned to something that “just works”. Pandoc fits my needs well. And when it doesn’t, there’s almost always a workaround that achieves the same thing neatly. And this is what this write-up is mostly about.

`default.latex`

? Bad idea.You could, of course, modify the default template (`default.latex`

) provided by Pandoc, as long as you’re no stranger to LaTeX. In this way, you can achieve anything you want – in *pure* LaTeX.

`$ pandoc --template my-default.latex src.md -o out.pdf`

There are, however, a few problems with this naïve approach:

- If you are tweaking the template just for something you’re currently working on, you will end up with some highly document-specific, hardly reusable template. Also this won’t give you any good for using Pandoc – you could just write plain LaTeX anyway.
- If Pandoc improves its default template for a newer version, your home-brewed template won’t benefit from this (unless you’re willing to merge the diffs and resolve any conflicts by hand).

I’m conservative about changing the templates. If it’s a general issue that needs to be fixed in the default template, sending a pull request to pandoc-templates might be a better idea. Of course, if there’s a certain submission format you have to stick with (given LaTeX templates for conference papers), then you will fall back on your own.

I wouldn’t claim that I know the best practice of using Pandoc, but there’s such a common idiom that cannot be overstressed: *Separate presentation and content!*

In the YAML front matter of `src.md`

(the main Markdown file you’re writing), put only things that matter to your potential readers:

```
---
title: Boilerplating Pandoc for Academic Writing
subtitle: or How I Learned to Stop Typesetting and Concentrate on the Math
author: Mort Yao
date: 17 November 2016
abstract: |
Lorem ipsum dolor sit amet, consectetur adipiscing elit,
sed do eiusmod tempor incididunt ut labore et dolore magna
aliqua. Ut enim ad minim veniam, quis nostrud exercitation
ullamco laboris nisi ut aliquip ex ea commodo consequat.
---
```

And in a separate YAML file (let’s call it `default.yaml`

), here goes the formatting stuff:

```
---
geometry: margin=1.5in
indent: true
header-includes: |
\usepackage{tcolorbox}
\newcommand\qed{\hfill\rule{1em}{1em}}
---
```

Above is my personal default, and it’s worth a few words to explain:

`geometry`

is where you control the geometric settings of your document. For example, you may narrow down the page margin to`margin=1.5in`

, and this is equivalent to raw LaTeX:`\usepackage[margin=1.5in]{geometry}`

- Set
`indent`

to any value other than`false`

if paragraph indentation is desired. (And it is often desired in formal publications.) `header-includes`

is where you define your own macros, configure existing ones, or claim`\usepackage`

in case you want to use a package not enabled by Pandoc (e.g.,`tcolorbox`

). Although you might as well define those in other places (e.g., in the content of a Markdown file),*don’t do that*.- This decent Q.E.D. tombstone:
`\newcommand\qed{\hfill\rule{1em}{1em}}`

is my favorite of all time. It doesn’t require the`amsthm`

package.

- This decent Q.E.D. tombstone:

With a separate `default.yaml`

, now here we are:

`$ pandoc default.yaml src.md -o out.pdf`

`header-includes`

You might have already noticed that the `subtitle`

field won’t display in the produced PDF file. As far as I’m concerned (in Pandoc 1.18), this is the expected behavior. See here in README:

To make

`subtitle`

work with other LaTeX document classes, you can add the following to`header-includes`

:`\providecommand{\subtitle}[1]{% \usepackage{titling} \posttitle{% \par\large#1\end{center}} }`

Unfortunately, this won’t work (until Issue #2139 is resolved) since Pandoc parses the `header-includes`

metadata field as Markdown, and the bracketed `[1]`

is misinterpreted as literals rather than a part of LaTeX control sequence. So the workaround is: Instead of embedding `header-includes`

as a metadata field in YAML, we should separate it into another file for this dedicated purpose (it’s simply raw LaTeX anyway), and include it using `--include-in-header/-H`

:

`$ pandoc -H header.tex default.yaml src.md -o out.pdf`

Note that you can’t have two `header-includes`

for one document. So the `header-includes`

field specified in YAML metadata will be overridden by the content of `header.tex`

.

While the Markdown syntax for citing is rather easy (`[@id]`

), it takes effort to make things right, especially if you have a certain preferred citation format (APA, MLA, Chicago, IEEE, etc.).

The suggestion is: Use pandoc-citeproc. Once you have a list of references you’re interested in, you need two things to typeset those nicely in your document:

- A CSL (Citation Style Language) file (
`.csl`

), to specify the citation format you want to use.- You can preview (and download) many common citation styles in the Zotero Style Repository.

- A BibTeX file (
`.bib`

), which is a list of all entries you might cite.- Citation entries in BibTeX format may be found easily on the Internet, through academic search engines and databases. Concatenate them one by one.

As part of the YAML metadata: (Assume you have `ieee.csl`

and `references.bib`

)

```
csl: ieee.csl
bibliography: references.bib
```

Using `pandoc-citeproc`

as a filter, generate the document with citations:

`$ pandoc --filter pandoc-citeproc -H header.tex default.yaml src.md -o out.pdf`

The list of references is appended to the end of the document. It is often desirable to give the references an obvious title (“References”), start from a new page and avoid any further indentation, so the following comes in the end of the Markdown source:

```
\newpage
\setlength\parindent{0pt}
# References
```

Basically, we need 5 files in total:

- For content:
`src.md`

(Markdown + possibly LaTeX mixed format): Main text.`references.bib`

(BibTeX/BibLaTeX format): List of references.

- For presentation:
`default.yaml`

(YAML format): Format-related metadata.`header.tex`

(LaTeX format): Content of`header-includes`

; package imports and macro definitions.`ieee.csl`

(CSL XML format): Citation style.

And one command:

`$ pandoc --filter pandoc-citeproc -H header.tex default.yaml src.md -o out.pdf`

`amsthm`

?Pandoc doesn’t provide native support for `amsthm`

(and I wonder if there will ever be). You can still have the same thing in Pandoc Markdown:

```
\newtheorem{definition}{Definition}
\begin{definition}
Man is a rational animal.
\end{definition}
```

However, everything in between `\begin`

and `\end`

will be treated as raw LaTeX, and the expressiveness of Markdown is lost there. More importantly, this is purely a LaTeX-specific thing, so there’s no way for Pandoc to convert this to HTML or any other format (unless you have a filter that does the trick). Consequently, I tend to write all definitions / theorems (lemmas, claims, corollaries, propositions…) in simple Markdown:

`**Definition 1.** *Man is a rational animal.*`

It does have some advantages over `amsthm`

:

- Using
`amsthm`

, you cannot see the numbering of each theorem (definition, etc.) in the text editor (well, you can’t without a dedicated plugin at least). This is inconvenient when you need to refer to a prior one later. By numbering them explicitly, you can clearly see these ordinals in the Markdown source. - It is perfectly valid Markdown, so it converts to any format as you wish (HTML, for example).

This also has some drawbacks compared to using `amsthm`

, though:

- It doesn’t have theorem counters. You need to number things explicitly, manually. (Clearly you can’t have implicit numbering and explicit numbering at the same time, so here’s the trade-off.)
- It doesn’t have automatic formatting. That is, you could possibly get the style for a certain entry (plain, definition, remark) wrong.
- Semantically, they are not recognized as theorems, just normal text paragraphs. This is problematic if you want to prevent definitions and theorems from being indented, since there’s no way for LaTeX to tell them from a normal text.

(Probably) The best solution is to write a filter that (conventionally) converts any plain text like `Definition 1`

(and `Lemma 2`

, `Theorem 3`

, etc.) in the beginning of a paragraph to proper Markdown (for HTML target) or corresponding `amsthm`

block (for LaTeX target). Even better, it should be able to do cross-references accordingly (Remember `Lemma 10.42`

? Let’s put an anchored link on that!). This is yet to be done, but would be very helpful to someone who does a lot of theorems and proofs thus wants to avoid the kludge of mixing raw LaTeX with semantically evident Markdown.

This is an expository reading summary of a selected CRYPTO 2015 paper I did as an assignment in KU’s Introduction to Modern Cryptography course. Adversarial-resilient Bloom filters are the counterparts of cryptographically secure hash functions in an adversarial setting, where adaptive adversaries that have access to a deterministic or non-deterministic query oracle may challenge the data structure in a way that intentionally increases the false positive rate of querying. As a preliminary result, this paper shows that the resistance of Bloom filters against computationally bounded adversaries requires that one-way functions exist; furthermore, such constructions are possible using pseudorandom permutations. I do find the proof a bit involved, but the notions of security introduced for Bloom filters are new and appealing (which I haven’t read previously anywhere else).

There is also a PDF version of this report.

Original paper:

**M. Naor and E. Yogev, “Bloom filters in adversarial environments,” in Annual Cryptology Conference, 2015.**[arXiv:1412.8356]

**Abstract**

Bloom filter is a hash-based probabilistic data structure which is space-efficient for set membership querying, with a small probability of false positives. Naor and Yogev’s 2015 paper introduces the adversarial model and formally proposes a strong notion of security for Bloom filters, i.e., *adversarial resilience*, based on an adversarial game under a cryptographic setting. This paper also discusses the correspondence between adversarial-resilient Bloom filters and the open assumption that one-way functions exist, thus enables theoretical constructions using pseudorandom permutations. We believe that such an understanding will help design practical Bloom filters that are safe from known attacks in software systems.

Probabilistic data structures are data structures that employ randomness in their designs to enable more efficient approaches of storing and querying data, compared to deterministic ones. Traditionally, the algorithmic probabilistic analysis of such data structures assumes the model where all inputs and queries are *independent* of the internal randomness of data structures. In this work, we consider an adversarial environment, where a computationally bounded adversary^{1} may adaptively chooses inputs and queries with the intention of degrading the efficiency of the underlying data structure of some computer system. By introducing the adversarial model, we analyze the behavior of such data representations under the cryptographic notion of computational security against adversaries; furthermore, it enables us to construct more efficient, provably secure schemes of probabilistic data structures.

As a concrete example, a Bloom filter is a probabilistic data structure that holds a set \(S\) of elements approximately, using significantly fewer bits of storage and allowing for faster access than a complete representation. As a trade-off between efficiency and preciseness, for any query of \(x \in S\), a Bloom filter always outputs a *yes*-answer, and for any query of \(x \not\in S\), it should output a *yes*-answer only with small probability. In other words, a *no*-answer given by a Bloom filter indicates unambiguously that \(x \not\in S\), while a *yes*-answer indicates that \(x \in S\) probably holds^{2}, that is, it allows false positives. Ideally, the error probability that a Bloom filter returns a false positive should be as small as possible.

Approaching the commonly-seen set membership problem, Bloom filters have been implemented widely in real-world applications, specifically as internal data representations for optimizing large-scale software systems. For example, Akamai’s CDN^{3} servers maintain Bloom filters in their memories to decide whether to lookup the disk cache for a requested resource, and a false positive of the Bloom filter causes a cache miss, which means that the server has to make an unnecessary disk lookup at an expense of time and system workload; if an attacker exploits the behaviors of the Bloom filter, it is possible for them to cast queries that degrade the disk cache hit rate of the CDN servers, and consequently, perform a Denial-of-Service (DoS) attack.[1] On another scenario, where Bitcoin clients apply Bloom filters in the Simplified Payment Verification (SPV) mode to increase the overall performance of wallet synchronization, an adversary may perform a DoS attack on an SPV node by learning from the responses of Bloom filters they have access to.[2]

As discussed above, the adversarial model addresses some security issues, thus the necessity of defining security in adversarial environments and constructing provably secure Bloom filters arises. Essentially, it is desirable for a well-constructed Bloom filter to maintain its small error probability in an adversarial environment; we say that such a Bloom filter is *adversarial resilient* (or just *resilient*). In an adversarial game, where an adversary has oracle access to the Bloom filter and is allowed to make a number of \(t\) queries before it outputs a certain \(x^*\) (that has not been queried before) which is believed to be a false positive, and if it is, the adversary wins the game. We say that a Bloom filter is \((n,t,\varepsilon)\)-*adversarial resilient* if when initialized over sets of size \(n\) then after \(t\) queries the probability of \(x^*\) being a false positive is at most \(\varepsilon\). A Bloom filter that is resilient for any polynomially many queries is said to be *strongly resilient*.

Clearly, a trivial construction of a strongly resilient Bloom filter would be a deterministic lookup table that stores \(S\) precisely, so that there is no false positive which an adversary can find. However, such a construction does not take advantage of the space and time efficiency as a normal Bloom filter would do, since it stores every element in the memory. In the following, we consider only non-trivial Bloom filters, and we show that for a non-trivial Bloom filter to be adversarial-resilient, one-way functions must exist; that is, if one-way functions do not exist, then any non-trivial Bloom filter can be attacked with a non-negligible probability by an efficient adversary. Furthermore, under the assumption that one-way functions exist, a pseudorandom permutation (PRP) can be used to construct a strongly resilient Bloom filter which has a reasonable memory consumption.

The construction of a Bloom filter consists of two algorithms: an initialization algorithm that gets a set \(S\) and outputs a memory-efficient representation of \(S\); a query algorithm that gets a representation of \(S\) and an \(x\) to be checked and outputs \(1\) if \(x \in S\), otherwise \(0\). Typically, the initialization algorithm is randomized but the query algorithm is deterministic, that is, a query operation does not amend the existing representation. We say that such a Bloom filter has a *steady representation*.^{4}

In the following model, we consider a universal set \(U\) and a subset \(S \subset U\) to be stored in a Bloom filter. We denote that \(u=|U|\) and \(n=|S|\).

**Definition 1. (Steady-representation Bloom filter)** *Let \(\mathbf{B}=(\mathbf{B}_1,\mathbf{B}_2)\) be a pair of polynomial-time algorithms, where \(\mathbf{B}_1\) is a randomized algorithm that gets as input a set \(S\) and outputs a representation \(M\), and \(\mathbf{B}_2\) is a deterministic algorithm that gets as input a representation \(M\) and a query element \(x \in U\). We say that \(\mathbf{B}\) is an \((n,\varepsilon)\)-Bloom filter (with a steady representation) if for any set \(S \subset U\) of size \(n\) it holds that:*

\(\forall x \in S, \Pr[\mathbf{B}_2(\mathbf{B}_1(S), x) = 1] = 1\)

**(Completeness)**\(\forall x \not\in S, \Pr[\mathbf{B}_2(\mathbf{B}_1(S), x) = 1] \leq \varepsilon\)

**(Soundness)**

*where the probability is taken over the randomness used by the algorithm \(\mathbf{B}_1\).*

Intuitively, the first property (completeness) says that for all elements in the set \(S\), the Bloom filter is guaranteed to output a *yes*-answer correctly; the second property (soundness) gives the upper bound that the Bloom filter outputs a false positive, that is, the query algorithm returns \(1\) when an element does not actually belong to the set \(S\). Formally,

**False positive and error rate.** Given a representation \(M\) of \(S\), if \(x \not\in S\) and \(\mathbf{B}_2(M,x)=1\), we say that \(x\) is a *false positive*. And we say that the probability bound \(\varepsilon\) of outputting false positives is the *error rate* of \(\mathbf{B}\).

In an adversarial environment, consider the following experiment for any Bloom filter \(\mathbf{B}=(\mathbf{B}_1,\mathbf{B}_2)\), adversary \(\mathcal{A}\), value \(t\) as the bound of the amount of queries which \(\mathcal{A}\) can make, and value \(\lambda\) as the security parameter.

**The Bloom filter resilience challenge experiment \(\mathsf{Challenge}_{\mathcal{A},\mathbf{B},t}(\lambda)\):**

- \(M \leftarrow \mathbf{B}_1(1^\lambda,S)\).
^{5} - \(x^* \leftarrow \mathcal{A}^{\mathbf{B}_2(M,\cdot)}(1^\lambda,S)\), where \(\mathcal{A}\) performs at most \(t\) queries \(x_1,\dots,x_t\) to the oracle \(\mathbf{B}_2(M,\cdot)\). Note that \(\mathcal{A}\) has only oracle access to the Bloom filter and cannot see the representation \(M\).
^{6} - The output of the experiment is defined to be \(1\), if \(x^* \not\in S \cup \{x_1,\dots,x_t\}\) and \(\mathbf{B}_2(M,x^*)=1\), and \(0\) otherwise. If the output of the experiment is \(1\), we say that \(\mathcal{A}\) succeeds.

**Definition 2. (Adversarial-resilient Bloom filter)** *Let \(\mathbf{B}=(\mathbf{B}_1,\mathbf{B}_2)\) be an \((n,\varepsilon)\)-Bloom filter (with a steady representation). We say that \(\mathbf{B}\) is an \((n,t,\varepsilon)\)-adversarial resilient Bloom filter if for any set \(S\) of size \(n\), for all sufficiently large \(\lambda \in \mathbb{N}\) and for all probabilistic polynomial-time adversaries \(\mathcal{A}\), it holds that* \[\Pr[\mathsf{Challenge}_{\mathcal{A},\mathbf{B},t}(\lambda) = 1] \leq \varepsilon\]

*where the probability is taken over the randomness used by the algorithm \(\mathbf{B}_1\) and \(\mathcal{A}\).*

To define the non-triviality of a Bloom filter formally, notice that it is always desirable to minimize the memory use of the Bloom filter. Let \(\mathbf{B}\) be an \((n,\varepsilon)\)-Bloom filter that uses \(m\) bits of memory. It is shown[3] that the lower bound of \(m\) is \(m \geq n \log \frac{1}{\varepsilon}\). Thus, we define

**Definition 3. (Minimal error)** *Let \(\mathbf{B}\) be an \((n,\varepsilon)\)-Bloom filter. We say that \(\varepsilon_0 = 2^{-\frac{m}{n}}\) is the minimal error of \(\mathbf{B}\).*

As mentioned previously, a trivial construction of Bloom filters is a lookup table that stores \(S\) precisely, in which case, the memory use \(m=\log \binom{u}{n} \approx n \log(\frac{u}{n})\), thus by using the bound \(m \geq n \log \frac{1}{\varepsilon}\), a construction is trivial if \(\varepsilon > \frac{n}{u}\). On the other hand, if \(u\) is super-polynomial in \(n\), then \(\varepsilon\) is negligible in \(n\) and every polynomial-time adversary has only negligible probability to find any false positive, therefore for such \(S \subset U\), the Bloom filter must be trivial. Notice that \(\varepsilon_o \leq \varepsilon\), we then define

**Definition 4. (Non-trivial Bloom filter)** *Let \(\mathbf{B}\) be an \((n,\varepsilon)\)-Bloom filter and let \(\varepsilon_0\) be the minimal error of \(\mathbf{B}\). We say that \(\mathbf{B}\) is non-trivial if there exists a constant \(c \geq 1\) such that \(\varepsilon_0 > \max\{\frac{n}{u},\frac{1}{n^c}\}\).*

We now show that the existence of adversarial resilient Bloom filters depends on the existence of one-way functions, that is, if any non-trivial, strongly resilient Bloom filter exists, then one-way functions also exist.

**Theorem 5.** *Let \(\mathbf{B}=(\mathbf{B}_1,\mathbf{B}_2)\) be any non-trivial Bloom filter (with a steady representation) of \(n\) elements that uses \(m\) bits of memory, and let \(\varepsilon_0\) be the minimal error of \(\mathbf{B}\). If \(\mathbf{B}\) is \((n,t,\varepsilon)\)-adversarial resilient for \(t=\mathcal{O}(\frac{m}{\varepsilon_0^2})\), then one-way functions exist.*

**Proof.** First we assume that one-way functions do not exist, then we show that we can construct a polynomial-time adversary \(\mathcal{A}\) such that \[\Pr[\mathsf{Challenge}_{\mathcal{A},\mathbf{B},t}(\lambda) = 1] > \varepsilon\] given a fixed value \(\varepsilon\). That is, \(\mathbf{B}\) cannot be \((n,t,\varepsilon)\)-adversarial resilient.

Define the following function \(f\): \[f(S,r,x_1,\dots,x_t) = (x_1,\dots,x_t,\mathbf{B}_2(M,x_1),\dots,\mathbf{B}_2(M,x_t))\] where \(S\) is a set of size \(n\), \(r\) is the number of bits used by the randomness of \(\mathbf{B}_1\), \(M\) is a representation of \(S\) generated by \(\mathbf{B}_1\), and \(t=\frac{200m}{\varepsilon_0}\). Clearly, \(f\) is polynomial-time computable.

Since \(f\) is not a one-way function (under the assumption that one-way functions do not exist), there is also an algorithm that can invert \(f\) efficiently. Thus we have,

**Claim 6.** *Assume that one-way functions do not exist, there exists a polynomial-time algorithm \(\mathcal{A}\) that inverts \(f\) with a failure probability of at most \(\frac{1}{100}\):* \[\Pr[f(\mathcal{A}(f(S,r,x_1,\dots,x_t))) \neq f(S,r,x_1,\dots,x_t)] < \frac{1}{100}\]

**Proof.** Because \(f\) is not a one-way function, there exists[4] an algorithm \(\mathcal{A}'\) such that \(\Pr[\mathsf{Invert}_{\mathcal{A}',f}(n) = 1] \geq \frac{1}{p(n)}\), where \(p(n)\) is polynomial in \(n\). Construct an algorithm \(\mathcal{A}\) that runs \(\mathcal{A}'\) individually for \(\lceil\frac{\log 100}{\log(p(n)) - \log(p(n)-1)}\rceil\) times, so we have the total failure probability \(\Pr[f(\mathcal{A}(f(S,r,x_1,\dots,x_t))) \neq f(S,r,x_1,\dots,x_t)] < \left(1-\frac{1}{p(n)}\right)^{\lceil\frac{\log 100}{\log(p(n)) - \log(p(n)-1)}\rceil} \leq \frac{1}{100}\)

■

Using \(\mathcal{A}\), construct the following probabilistic polynomial-time algorithm \(\mathsf{Attack}\):

The Algorithm \(\mathsf{Attack}\)\(\mathsf{Attack}\) is given oracle access to the query algorithm \(\mathbf{B}_2(M,\cdot)\), and gets \(1^\lambda\) as input.

- For \(i \in \{1,\dots,t\}\), sample \(x_i \in U\) uniformly, and query \(y_i = \mathbf{B}_2(M,x_i)\).
- Run \(\mathcal{A}\) (the inverter of \(f\)) and get \((S',r',x_1,\dots,x_t) \leftarrow \mathcal{A}(x_1,\dots,x_t,y_1,\dots,y_t)\).
- Compute \(M' \overset{r'}{\leftarrow} \mathbf{B}_1(1^\lambda,S')\), using \(r'\) as the randomness bits in the initialization.
- For \(k=1,\dots,\frac{100}{\varepsilon_0}\), do:

- Sample \(x^* \in U\) uniformly.
- If \(\mathbf{B}_2(M',x^*)=1\) and \(x^* \not\in \{x_1,\dots,x_t\}\), output \(x^*\) and HALT.
- Sample \(x^* \in U\) uniformly, and output \(x^*\).

**Claim 7.** *Assume that \(\mathcal{A}\) inverts \(f\) successfully. For any representation \(M\), the probability such that there exists a representation \(M'\) that for \(i \in \{1,\dots,t\}\), \(\mathbf{B}_2(M,x_i)=\mathbf{B}_2(M',x_i)\), and that the error rate \(\Pr[\mathbf{B}_2(M,x) \neq \mathbf{B}_2(M',x)] > \frac{\varepsilon_0}{100}\) is at most \(\frac{1}{100}\).*

**Proof.** From the error rate of any \(x\) and the independence of the choice of \(x_i\), we get \[\Pr[\forall i \in \{1,\dots,t\} : \mathbf{B}_2(M,x_i) = \mathbf{B}_2(M',x_i)] \leq \left(1 - \frac{\varepsilon_0}{100}\right)^t\]

Since the Bloom filter uses \(m\) bits of memory, there are \(2^m\) possible representations as candidates for \(M'\). Thus, by union bound, \[\Pr[\exists M' \ \forall i \in \{1,\dots,t\} : \mathbf{B}_2(M,x_i) = \mathbf{B}_2(M',x_i)] \leq 2^m \left(1 - \frac{\varepsilon_0}{100}\right)^t \leq \frac{1}{100}\]

Since \(\mathcal{A}\) is assumed to invert \(f\) successfully, it must output a representation \(M'\) such that for \(i \in \{1,\dots,t\}\), \(\mathbf{B}_2(M,x_i)=\mathbf{B}_2(M',x_i)\). Therefore, the above bound holds.

■

Define \(\mu(M)=\Pr_{x \in U}[\mathbf{B}_2(M,x)=1]\) as the positive rate over \(U\), we now show that for almost all possible representations \(M\) generated from set \(S\) and randomness \(r\), it holds true that \(\mu(M) > \frac{\varepsilon_0}{8}\), with only a negligible probability of error:

**Claim 8.** \(\Pr_S[\exists r : \mu(M_r^S) \leq \frac{\varepsilon}{8}] \leq 2^{-n}\).

**Proof.** Let \(\mathsf{BAD}\) be the set of all sets \(S\) such that there exists an \(r\) such that \(\mu(M_r^S) \leq \frac{\varepsilon_0}{8}\). Given \(S \in \mathsf{BAD}\), let \(\hat{S}\) be the set of all elements \(x\) such that \(\mathbf{B}_2(M_r^S,x)=1\), then \(|\hat{S}| \leq \frac{\varepsilon_0}{8} \cdot u\). Notice that we can encode the set \(S\) using the representation \(M_r^S\) while specifying \(S\) from all subsets of \(\hat{S}\) of size \(n\), and the encoding bits must be no less than \(\log|\mathsf{BAD}|\) (which is the number of bits required to encode \(|\mathsf{BAD}|\)): \[\log|\mathsf{BAD}| \leq m + \log\binom{\frac{\varepsilon_0 u}{8}}{n}
\leq m + n \log\left(\frac{\varepsilon_0 u}{8}\right) - n \log n + 2n
\leq -n + \log\binom{u}{n}
\] thus \(|\mathsf{BAD}| \leq 2^{-n}\binom{u}{n}\). Since the number of sets \(S\) is \(\binom{u}{n}\), \(\Pr_S[\exists r : \mu(M_r^S) \leq \frac{\varepsilon}{8}] \leq 2^{-n}\).

■

**Claim 9.** *Assume that \(\Pr[\mathbf{B}_2(M,x) \neq \mathbf{B}_2(M',x)] \leq \frac{\varepsilon_0}{100}\) and that \(\mu(M) > \frac{\varepsilon_0}{8}\). The probability that \(\mathsf{Attack}\) does not halt on Step 4 is at most \(\frac{1}{100}\).*

**Proof.** It follows directly from the assumptions that \(\mu(M') > \frac{\varepsilon_0}{8} - \frac{\varepsilon_0}{100} > \frac{\varepsilon_0}{10}\). For convenience, let \(\mathcal{X} = \{x_1,\dots,x_t\}\) and \(\hat{S'} = \{x : \mathbf{B}_2(M',x)=1\}\). We have that

\[E[|\hat{S'} \cap \mathcal{X}|] = t \cdot \mu(M') > \frac{200m}{\varepsilon_0} \cdot \frac{\varepsilon_0}{10} = 20m\]

By Chernoff bound with a probability of at least \((1-e^{-\Omega(m)})\) we have that \(|\hat{S'} \cap \mathcal{X}| < 40m\), \[|\hat{S'} \backslash \mathcal{X}|=|\hat{S'}|-|\hat{S'} \cap \mathcal{X}| > |\hat{S'}| - 40m \geq \frac{\varepsilon_0 u}{10} - 40m\]

*Case 1.* \(u=n^d\) (\(d\) is a constant). We show that \(|\hat{S'} \backslash \mathcal{X}| \geq 1\), so that an exhaustive search over the universal set \(U\) is efficient and guaranteed to find an element \(x^*\) in \(\hat{S'} \backslash \mathcal{X}\). Let \(c\) be a constant such that \(\varepsilon_0 > \frac{1}{n^c}\), \(\varepsilon_0 < \frac{1}{n^{c-1}}\). We have \(\frac{u}{n} = n^{d-1} \geq \frac{1}{\varepsilon_0} > n^{c-1}\), then \(d-c>1\). Moreover, \(m \leq n \log\frac{u}{n} \leq nd \log n\), thus, \[|\hat{S'} \backslash \mathcal{X}| \geq \frac{\varepsilon_0 u}{10} - 40m
\geq \frac{n^{d-c}}{10} - 40nd \log n > 1\]

*Case 2.* \(u\) is super-polynomial in \(n\). We show that the fraction of \(|\hat{S'} \backslash \mathcal{X}|\) is large enough so that sampling can find an \(x^*\), with only a small failure probability. Since \(\frac{\varepsilon_0}{20}\) is polynomial in \(\frac{1}{n}\) but \(\frac{40m}{u} \leq \frac{40n \log u}{u}\) is negligible, we get that \(\frac{\varepsilon_0}{20} > \frac{40m}{u}\). It follow from \(\frac{|\hat{S'} \backslash \mathcal{X}|}{u} > \frac{\varepsilon_0}{10} - \frac{40m}{u}\) that \(\frac{|\hat{S'} \backslash \mathcal{X}|}{u} > \frac{\varepsilon_0}{20}\). Thus, the probability of sampling \(x \not\in \hat{S'} \backslash \mathcal{X}\) in all \(k\) attempts is bounded by \[\left(1-\frac{\varepsilon_0}{20}\right)^k
= \left(1-\frac{\varepsilon_0}{20}\right)^\frac{100}{\varepsilon_0}
< \frac{1}{100}
\]

In both cases, the probability that \(\mathsf{Attack}\) fails to find \(x^*\) and halt on Step 4 is less than \(\frac{1}{100}\).

■

**Claim 10.** \(\Pr[\mathbf{B}(M',x^*)=1; \mathbf{B}_2(M,x^*)=0] \leq \frac{1}{100}\)

**Proof.** This follows from the assumption that \(\Pr[\mathbf{B}_2(M,x) \neq \mathbf{B}_2(M',x)] \leq \frac{\varepsilon}{100}\).

■

Consider **Claim 6, 7, 8 & 10** which cover all cases that \(\mathsf{Attack}\) fails, each happens only if the respective assumptions hold, so they provide an upper bound of failure probability. Taking a union bound, we have the total failure probability is at most \(\frac{4}{100}\). Thus, we have constructed a polynomial-time adversary \(\mathsf{Attack}\) such that \[\Pr[\mathsf{Challenge}_{\mathsf{Attack},\mathbf{B},t}(\lambda) = 1] > \varepsilon \geq 1-\frac{4}{100}\] therefore \(\mathbf{B}\) cannot be \((n,t,\varepsilon)\)-adversarial resilient, which is a contradiction implying that such adversarial resilient Bloom filters do not exist, under the assumption that one-way functions do not exist. By modus tollens, we have that if non-trivial \((n,t,\varepsilon)\)-adversarial resilient Bloom filters exist, then one-way functions exist.

■

In **Theorem 5** we showed that the existence of adversarial resilient Bloom filters implies that one-way functions exist. Furthermore, assume that adversarial resilient Bloom filters exist (thus one-way functions exist), it can be shown that pseudorandom permutations may be used to construct non-trivial, strongly adversarial resilient Bloom filters. We have

**Proposition 11. (Construction using pseudorandom permutations)**^{7} *Let \(\mathbf{B}\) be an \((n,\varepsilon)\)-Bloom filter using \(m\) bits of memory. If pseudorandom permutations exist, then for any security parameter \(\lambda\), there exists an \((n,\varepsilon+\mathsf{negl}(\lambda))\)-strongly resilient Bloom filter that uses \(m'=m+\lambda\) bits of memory.*

**Unsteady representation and computationally unbounded adversary.** The above discussion about Bloom filters assumes that steady representation is used, that is, the query algorithm \(\mathbf{B}_2\) is deterministic. Some implementations allow \(\mathbf{B}_2\) to change the internal representation thus querying can also be probabilistic. Further results regarding *unsteady representations* may be found in [5], with the ACD^{8} framework proposed in [6]. Moreover, some results are shown to hold even for *computationally unbounded adversaries*.

**Bloom filters and secrecy.** Bloom filters, like hash functions, are not designed as encryption schemes. Thus, even adversarial resilient Bloom filters may leak considerable information in an unintended way. As a concrete example, in Bitcoin lightweight SPV clients which rely on Bloom filters to store users’ Bitcoin addresses, an adversary can efficiently distinguish information about these addresses. See [7] for a discussion on this interesting scenario.

[1] B. M. Maggs and R. K. Sitaraman, “Algorithmic nuggets in content delivery,” *ACM SIGCOMM Computer Communication Review*, vol. 45, no. 3, pp. 52–66, 2015.

[2] R. Benjamin and E. K. Yasmine, “Attacks on bitcoin,” 2015.

[3] L. Carter, R. Floyd, J. Gill, G. Markowsky, and M. Wegman, “Exact and approximate membership testers,” in *Proceedings of the tenth annual acm symposium on theory of computing*, 1978, pp. 59–65.

[4] J. Katz and Y. Lindell, “Introduction to modern cryptography,” 2014.

[5] M. Naor and E. Yogev, “Bloom filters in adversarial environments,” in *Annual cryptology conference*, 2015, pp. 565–584.

[6] M. Naor and G. N. Rothblum, “Learning to impersonate,” in *Proceedings of the 23rd international conference on machine learning*, 2006, pp. 649–656.

[7] A. Gervais, S. Capkun, G. O. Karame, and D. Gruber, “On the privacy provisions of bloom filters in lightweight bitcoin clients,” in *Proceedings of the 30th annual computer security applications conference*, 2014, pp. 326–335.

We will only consider polynomial-time adversaries here.↩

For this reason, a

*yes*-answer is also referred to as a*maybe*-answer in some texts.↩A CDN (Content Delivery Network) is a globally distributed network of proxy servers deployed in multiple data centers, in order to serve cached content to end-users with high availability and high performance.↩

In this work, we will consider only steady representations.↩

The security parameter \(\lambda\) is implied in the initialization algorithm \(\mathbf{B}_1\), thus we denote it as \(\mathbf{B}_1(1^\lambda,S)\).↩

To strengthen our definition, we assume that the adversary also gets the set \(S\), and it is important that the adversary can hardly find any false positive even if given \(S\).↩

A proof of this can be found in [5].↩

Learning model of adaptively changing distributions (ACD).↩

**(20 Nov 2016) Correction:** P ≠ NP is not sufficient to imply that one-way functions exist. See P versus NP problem and one-way functions.

**Intro.** Starting from November, I’ll summarize my study notes on wiki into weekly blog posts. I always wanted to keep my study progress on track; I feel that it’s hard even to convince myself without visible footprints.

So here we have the first episode. (Hopefully it won’t be the last one)

Asymptotic notation is an important tool in analyzing the time/space efficiency of algorithms.

- Limit
- Formal definition of limit (the (ε, δ)-definition) in calculus. Note that limits involving infinity are closely related to asymptotic analysis. In addition to basic limit rules, L’Hôpital’s rule is also relevant.

- Asymptotic notation
- Introduction to the Bachmann–Landau notation family (among them are the most widely-used Big O notation and Big Theta notation).
- Master theorem is used to find the asymptotic bound for recurrence. This is particularly helpful when analyzing recursive algorithms (e.g., binary search, merge sort, tree traversal).
- Based on common orders of asymptotic running time using Big O notation, we can categorize algorithms into various classes of time complexities (among them are P, DLOGTIME, SUBEXP and EXPTIME). Note that we have not formally defined the word “algorithm” and “complexity class” yet.

For decision problems, we now formally define the time complexity classes P and NP, and propose the hardness of NP-complete problems, which plays an indispensable role in the study of algorithm design and modern cryptography.

- Formal language
- Formal definition of language. We will revisit this when studying formal grammars like the context-free grammar and parsing techniques for compilers. For now, it suffices to know that binary string is a common encoding for all kinds of problems (especially, decision problems).

- Decidability
- Among all abstract problems, we are mostly interested in decision problems.
- The decidability of a language depends on whether there exists an algorithm that decides it.

- Reducibility
- Polynomial-time reduction is a commonly used technique that maps one language to another.
- What is a hard language for a complexity class; what is a complete language for a complexity class.

- Time complexity
- Encodings of concrete problems matter. Normally we would choose a “standard encoding” for our language of interest.
- Polynomial-time algorithms are considered to be efficient and languages which have polynomial-time algorithms that decide them are considered tractable.
- P is the time complexity class of all problems that are polynomial-time solvable.
- NP is the time complexity class of all problems that are polynomial-time verifiable.

- NP-completeness
- The set of languages that are complete for the complexity class NP, that is, the “hardest problems” in NP.
- NP-complete problems are central in answering the open question whether P = NP.
- We (informally) show that every NP problem is polynomial-time reducible to CIRCUIT-SAT, and that CIRCUIT-SAT is NP-complete.
- There are other problems (SAT, 3-CNF-SAT, CLIQUE, VERTEX-COVER, HAM-CYCLE, TSP, SUBSET-SUM) polynomial-time reducible from one to another, thus they are also shown to be NP-complete.

**Computational hardness assumption P ≠ NP.** Although it is still an open proposition, many believe that P ≠ NP. Notably, if P ≠ NP holds true,

- If a decision problem is polynomial-time unsolvable in general case, we should strive to find approximations or randomized algorithms; exact algorithms cannot be run in worst-case polynomial time thus may not be efficient. This applies to optimization problems too.
~~One-way functions exist, which implies that pseudorandom generators and functions exist. Consequently, many cryptographic constructions (private-key encryption, MACs, etc.) are provably computationally secure.~~

(I stand corrected: There is no such a known proof showing that P ≠ NP implies the existence of one-way functions. However, reversely, the existence of one-way functions implies that P ≠ NP. There is an informal argument given by Peter Shor on StackExchange^{1}, rephrased in the section P versus NP problem and one-way functions.)

Later we will cover the notion of security in cryptography, so there is a refresher of basic probability: (Probability is also considerably used in analyzing the behaviors of non-deterministic algorithms, hash functions, etc.)

- Probability
- An intuitive introduction to basic probability theory based on Kolmogorov’s axioms, including the union bound (Boole’s inequality) and its generalized form Bonferroni inequalities, the conditional probability and Bayes’ theorem. We will revisit the notion of probability space when coming to measure theory.

Plan for next week:

**(Algorithms)**More involved NP-complete problems. Exact algorithms. Approximation algorithms. Probabilistic algorithms.**(Cryptography)**Information-theoretic/computational security (semantic security, IND, IND-CPA, IND-CCA). Private-key encryption. Message authentication codes. Hash functions. Theoretical constructions (one-way functions, pseudorandomness). Practical constructions (Feistel network, substitution-permutation network, DES, AES).

Consider the following map: \[f : (x, r) \to s\] where \(x\) is an arbitrary bit string, \(r\) is a string of random bits, and \(s\) is an instance of a \(k\)-SAT problem having \(x\) as a planted solution, while the randomness of \(r\) determines uniquely which \(k\)-SAT problem to choose.

If we can invert the above function \(f\) (in polynomial time), we must already have solved the corresponding \(k\)-SAT problem \(s\) with a planted solution \(x\). \(k\)-SAT problems are known to be NP-complete, and inverting such a function would be as hard as solving a \(k\)-SAT problem with a planted solution, that is, inverting \(f\) *at one point* can be hard. Clearly, should we have a one-way function, then inverting it is guaranteed to be no easier than inverting \(f\).

So what does it mean if P ≠ NP? We know that \(k\)-SAT problem is hard to solve in its *worst case*, so function \(f\) can be made as hard to invert as solving a \(k\)-SAT problem in its *worst case*. However, we don’t know whether it’s possible to have a class \(\mathcal{S}\) of \(k\)-SAT problems with planted solutions that are as hard as general-case \(k\)-SAT problems. If such a class \(\mathcal{S}\) exists, then given any \(s \in \mathcal{S}\), no probabilistic polynomial-time algorithm is able to get \(x\) with a non-negligible probability, so we can conclude that \(f\) is indeed a one-way function. [1]

**Problem 1.1.** Does there exist a class \(\mathcal{S}\) of \(k\)-SAT problems with planted solutions, such that every \(L \in \mathcal{S}\) is NP-hard?

**Conjecture 1.2.** *If \(\mathrm{P} \neq \mathrm{NP}\), then one-way functions exist.*

On the other hand, assume that \(f\) is a one-way function, so that one-way functions do exist, then this implies that \(k\)-SAT problem is hard to solve (in its worse case) by a polynomial-time algorithm, thus we have P ≠ NP. By modus tollens, if P = NP, then no one-way function exists. [2]

**Theorem 1.3.** *If one-way functions exist, then \(\mathrm{P} \neq \mathrm{NP}\).*

*Proof.* *(Sketch)* Let \(f : \{0,1\}^*\to\{0,1\}^*\) be a one-way function. There is a polynomial-time algorithm \(M_f\) that computes \(y=f(x)\) for all \(x\), thus, there exists a polynomial-time computable circuit that outputs \(y=f(x)\) for all \(x\).

Since \(f\) is a one-way function, that is, for every probabilistic polynomial-time algorithm \(\mathcal{A}\), there is a negligible function \(\mathsf{negl}\) such that \(\Pr[\mathsf{Invert}_{\mathcal{A},f}(n) = 1] \leq \mathsf{negl}(n)\), so we know that no \(\mathcal{A}\) can fully compute \(f^{-1}(x)\) for any given \(x\). \(\mathcal{A}\) fully computes \(f^{-1}\) if and only if it solves the corresponding `CIRCUIT-SAT`

problems of the circuit in all cases. Thus, there must exist some `CIRCUIT-SAT`

problems that cannot be decided by a polynomial-time algorithm, therefore, \(\mathrm{P} \neq \mathrm{NP}\).

■

*Remark 1.4.* If one can come up with a construction of the one-way function or a proof that such functions exist, then it holds true that \(\mathrm{P} \neq \mathrm{NP}\).

**Books:**

T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein, *Introduction to Algorithms*, 3rd ed.

M. Sipser, *Introduction to the Theory of Computation*, 3rd ed.

J. Katz and Y. Lindell, *Introduction to Modern Cryptography*, 2nd ed.

**Papers:**

[1] A. L. Selman, “A survey of one-way functions in complexity theory,” *Mathematical Systems Theory*, vol. 25, no. 3, pp. 203–221, 1992.

[2] M. Abadi, E. Allender, A. Broder, J. Feigenbaum, and L. A. Hemachandra, “On generating solved instances of computational problems,” in *Proceedings on advances in cryptology*, 1990, pp. 297–310.