librsync  2.3.0
TODO.md
1 * We have a few functions to do with reading a netint, stashing
2  it somewhere, then moving into a different state. Is it worth
3  writing generic functions for that, or would it be too confusing?
4 
5 * Duplicate block handling. Currently duplicate blocks are included in
6  the signature, but we only put the first duplicate block in the
7  hashtable so the delta only includes references to the first block.
8  This can result in sub-optimal copy commands, breaking single large
9  copies with duplicate blocks into multiple copies referencing the
10  earlier copy of the block. However, this could also make patching use
11  the disk cache more effectively. This solution is probably fine,
12  particularly given how small copy instructions are, but there might be
13  solutions for improving copy commands for long runs of duplicate blocks.
14 
15 * Optimisations and code cleanups;
16 
17  scoop.c: Scoop needs major refactor. Perhaps the API needs
18  tweaking?
19 
20  rsync.h: rs_buffers_s and rs_buffers_t should be one typedef?
21 
22  * Just how useful is rs_job_drive anyway?
23 
24  mdfour.c: This code has a different API to the RSA code in libmd
25  and is coupled with librsync in unhealthy ways (trace?). Recommend
26  changing to RSA API?
27 
28 * Don't use the rs_buffers_t structure.
29 
30  There's something confusing about the existence of this structure.
31  In part it may be the name. I think people expect that it will be
32  something that behaves like a FILE* or C++ stream, and it really
33  does not. Also, the structure does not behave as an object: it's
34  really just a shorthand for passing values in to the encoding
35  routines, and so does not have a lot of identity of its own.
36 
37  An alternative might be
38 
39  result = rs_job_iter(job,
40  in_buf, &in_len, in_is_ending,
41  out_buf, &out_len);
42 
43  where we update the length parameters on return to show how much we
44  really consumed.
45 
46  One technicality here will be to restructure the code so that the
47  input buffers are passed down to the scoop/tube functions that need
48  them, which are relatively deeply embedded. I guess we could just
49  stick them into the job structure, which is becoming a kind of
50  catch-all "environment" for poor C programmers.
51 
52 * Meta-programming
53 
54  * Plot lengths of each function
55 
56  * Some kind of statistics on delta each day
57 
58 * Encoding format
59 
60  * Include a version in the signature and difference fields
61 
62  * Remember to update them if we ever ship a buggy version (nah!) so
63  that other parties can know not to trust the encoded data.
64 
65 * abstract encoding
66 
67  In fact, we can vary on several different variables:
68 
69  * what signature format are we using
70 
71  * what command protocol are we using
72 
73  * what search algorithm are we using?
74 
75  * what implementation version are we?
76 
77  Some are more likely to change than others. We need a chart
78  showing which source files depend on which variable.
79 
80 * Encoding implementation
81 
82  * Join up signature commands
83 
84 * Encoding algorithm
85 
86  * Self-referential copy commands
87 
88  Suppose we have a file with repeating blocks. The gdiff format
89  allows for COPY commands to extend into the *output* file so that
90  they can easily point this out. By doing this, they get
91  compression as well as differencing.
92 
93  It'd be pretty simple to implement this, I think: as we produce
94  output, we'd also generate checksums (using the search block
95  size), and add them to the sum set. Then matches will fall out
96  automatically, although we might have to specially allow for
97  short blocks.
98 
99  However, I don't see many files which have repeated 1kB chunks,
100  so I don't know if it would be worthwhile.
101 
102  * Extended files
103 
104  Suppose the new file just has data added to the end. At the
105  moment, we'll match everything but the last block of the old
106  file. It won't match, because at the moment the search block
107  size is only reduced at the end of the *new* file. This is a
108  little inefficient, because ideally we'd know to look for the
109  last block using the shortened length.
110 
111  This is a little hard to implement, though perhaps not
112  impossible. The current rolling search algorithm can only look
113  for one block size at any time. Can we do better? Can we look
114  for all block lengths that could match anything?
115 
116  Remember also that at the moment we don't send the block length
117  in the signature; it's implied by the length of the new block
118  that it matches. This is kind of cute, and importantly helps
119  reduce the length of the signature.
120 
121  * State-machine searching
122 
123  Building a state machine from a regular expression is a brilliant
124  idea. (I think *The Practice of Programming* walks through the
125  construction of this at a fairly simple level.)
126 
127  In particular, we can search for any of a large number of
128  alternatives in a very efficient way, with much less effort than
129  it would take to search for each the hard way. Remember also the
130  string-searching algorithms and how much time they can take.
131 
132  I wonder if we can use similar principles here rather than the
133  current simple rolling-sum mechanism? Could it let us match
134  variable-length signatures?
135 
136 * Support gzip compression of the difference stream. Does this
137  belong here, or should it be in the client and librsync just have
138  an interface that lets it cleanly plug in?
139 
140  I think if we're going to just do plain gzip, rather than
141  rsync-gzip, then it might as well be external.
142 
143 * rsync-gzip: preload with the omitted text so as to get better
144  compression. Abo thinks this gets significantly better
145  compression. On the other hand we have to important and maintain
146  our own zlib fork, at least until we can persuade the upstream to
147  take the necessary patch. Can that be done?
148 
149  abo says
150 
151  It does get better compression, but at a price. I actually
152  think that getting the code to a point where a feature like
153  this can be easily added or removed is more important than the
154  feature itself. Having generic pre and post processing layers
155  for hit/miss data would be useful. I would not like to see it
156  added at all if it tangled and complicated the code.
157 
158  It also doesn't require a modified zlib... pysync uses the
159  standard zlib to do it by compressing the data, then throwing
160  it away. I don't know how much benefit the rsync modifications
161  to zlib actually are, but if I was implementing it I would
162  stick to a stock zlib until it proved significantly better to
163  go with the fork.
164 
165 * Licensing
166 
167  Will the GNU Lesser GPL work? Specifically, will it be a problem
168  in distributing this with Mozilla or Apache?
169 
170 * Checksums
171 
172  * Do we really need to require that signatures arrive after the
173  data they describe? Does it make sense in HTTP to resume an
174  interrupted transfer?
175 
176  I hope we can do this. If we can't, however, then we should
177  relax this constraint and allow signatures to arrive before the
178  data they describe. (Really? Do we care?)
179 
180  * Allow variable-length checksums in the signature; the signature
181  will have to describe the length of the sums and we must compare
182  them taking this into account.
183 
184 * Testing
185 
186  * Just more testing in general.
187 
188  * Test broken pipes and that IO errors are handled properly.
189 
190  * Test files >2GB, >4GB. Presumably these must be done in streams
191  so that the disk requirements to run the test suite are not too
192  ridiculous. I wonder if it will take too long to run these
193  tests? Probably, but perhaps we can afford to run just one
194  carefully-chosen test.
195 
196  * Fuzz instruction streams. <https://code.google.com/p/american-fuzzy-lop/>?
197 
198  * Generate random data; do random mutations.
199 
200  * Try different block lengths.
201 
202  * Tests should fail if they can't find their inputs, or have zero
203  inputs: at present they tend to succeed by default.
204 
205  * Test varying strong-sum inputs: default, short, long.
206 
207 * Security audit
208 
209  * If this code was to read differences or sums from random machines
210  on the network, then it's a security boundary. Make sure that
211  corrupt input data can't make the program crash or misbehave.
212 
213 * Long files
214 
215  * How do we handle the large signatures required to support large
216  files? In particular, how do we choose an appropriate block size
217  when the length is unknown? Perhaps we should allow a way for
218  the signature to scale up as it grows.
219 
220 * Perhaps make extracted signatures still be wrapped in commands.
221  What would this lead to?
222 
223  * We'd know how much signature data we expect to read, rather than
224  requiring it to be terminated by the caller.