Once again I apologize for neglecting my email -- I'm spending most of
my time finding and downloading data from the web. The good news is
that there is a huge ocean of available data out there, much more than
I expected. The bad news is that the people who gathered it have used
a wide range of data collection strategies and file formats, so it
takes a lot more work to do anything with it than I expected. Even
just verifying that there is useful data in the downloaded files can
be quite difficult!
Along with freely available and redistributable simulation software I
think I'll try to come up with some non-proprietary file format standards
and software to convert existing codebook and dataset files.
I'm downloading a lot of educational and social survey data as well as
purely economic stuff. As I see it, the economy depends on many
factors that usually ignored by economists, and I don't want to ignore
any data that might help make the simulation accurate.
If you have a favourite social or economic factor, please let me know,
and I'll try to find some data for it. I've already noted Jay Hanson's
concern for the energy-cost of energy, Tom Walker's concern about
investments at compound interest, and Eva Durant's interest in other
forms of ownership. Whether I agree with these concerns is not
important, I will still try to find some data about them.
This simulation should be self-validating in two ways:
1. Complete cross-validation capabilities will be provided -- that
is to say, the whole simulation minus one individual datum can
be used estimate that single number, and the result compared with
the actual number. If enough processing power is available, that
can be done with ALL numbers, which will not only verify the
simulation as a whole but help find erroneous data.
2. As time goes by, new data like this month's unemployment rate will
be added to tables that previously included only unemployment
rates for previous months. As these are added, they can be
compared with predictions made last month, the month before that,
and so on. The difference between prediction and reality is often
called a residual, since it is what is left over after subtracting
the prediction. The smaller the residuals, the better the model.
In general more data is better than less, as long as it doesn't
contain too many errors, so I'll be continuing to collect data as
I find it.
I should point out that like Eva, Tom, and Jay, I do have an axe to
grind, my own pet theory -- the combinatorial stuff I've bored you
with before. You may well worry about my own theories being treated
too benevolently in this simulation, but I will be making all
documentation, source code, and data public, which will let you snoop
around looking for bias to your heart's content.
I'm posting right after this a copy of a message I just sent to the
socialtechnology mailing list, which notes that survey data files
collected for this simulation will also be useful for my other
project, an attempt to do social matching and optimization. In fact
all my projects tend to blur into one another, and both the simulation
and matching projects will use software written earlier for semantic
network analysis (a species of linguistics).
I'll be putting some of that code up on the web as soon as I find
myself a new web host that supports FTP and CGI as well as ordinary
web pages -- your recommendations or warnings about web hosts would
be appreciated.
dpw
Douglas P. Wilson [EMAIL PROTECTED]
http://www.island.net/~dpwilson/index.html