Again, this is more for a user@ list.... Sorry.
I want to confirm I understand refetching correctly.
When the crawler goes to refetch a page, it adds the If-Modified-Since
and the If-None-Match (if an etag exists) headers. If the host
respects those, it will return a 200 and new content if something has
changed, otherwise it will return a non-200.
If the host doesn't respect those headers and returns exactly the same
bytes as were originally fetched with a 200, that content is returned
and written to a bolt.
In short, if we're writing to warcs, and we refetch a page that
returns a 200 and the contents are the same as we originally fetched,
we'll have two copies of the same content?
Thank you!
Best,
Tim