tomas wrote: >>>> I mean a general tool, but with options to tweak the >>>> report included, of course. >>> >>> If you can bear some tweaking, R is it. >> >> Sure! Let's run R on this e-mail. Does it work and if so, what >> does it say? > > T a generic question -- a generic answer
R is a programming language, I'm looking for a tool that produces stats from text. If such a tool uses R, or any other programming language or stats engine to produce the outcome, for me as a potential user that is entirely up to them who write it. > I don't even know what you mean by "general stats" Some examples from doing stats on text are: average word lenght, most commonly used words, the longest paragraphs ... Those are simple examples, the next step it gets more interesting as it could show what is statistically unusual, that would be fun/exotic stats that a human user would probably not spot. E.g., parsing this mail, it could say "Emanuel Berg is almost always calm and collective, entirely professional in his approach, but here in the 4th paragraph of his mail he gets VISIBLY UPSET using CAPS ONLY, possibly expressing FRUSTRATION about NOT BEING UNDERSTOOD." > the sports example you put in the other mail suggests that > you want statistics gathered about a subject from written > text In the sports world they input the stats manually and that data is then crunched by computers to produce lists and neat graphics for their broadcasts. This is the first step described above. This isn't unlike for example Emacs `count-words-region' in combination with gnuplot - indeed, it is exactly the same, almost, as these chars I type now are produced manually, then Emacs could count and gnuplot could show. This first step would be neat depending on how much stuff is quantified, the more the better obviously. The second step however, that would be those "fun facts" the commentators say, these are more advanced, like, and now I just make something up, "Here is an amazing figure. Player X has the worst stats on face-offs in his team, except when the team plays on its home field and is down by two or more goals, then he is 2nd best". That second step, to have with text, would of course be even more exciting. I don't know if those crazy stats are discovered by a bunch of fanatic hockey nerds just using the "step 1 stats" in creative combinations - maybe using some sort of relational algebra approach? - _or_ if they have some stats engine that crunches the stats further to the meta-stats level, if you will, to have the weird facts pop up automatically? But yeah, if we don't even have a proper "step 1 stats" tool for text - which is impossible to believe BTW - well, obviously one can only dream of a "step 2 stats", a stats tool on the meta level ... > involves "understanding texts written in human languages", > another big can of worms (which has become somewhat > fashionable as of late). It is not about understanding, it is about finding patterns and meta-patterns, finding statistics that are themselves statistically uncommon, which is why they are interesting. Think exceptions and unexpected interrelations. Again, the best example is probably a combination of the different stats available at the "step 1 stats" level. > If it's text statistics, good statistics packages have lots > of resources. R is a good statistics package Yeah, maybe I should ask them but as Debian is such a huge system one would think someone here could show us how it, or similar software can be used on a bunch of text, for example on a mail like this. It is already a bunch of data, surely you are not saying there isn't a tool to tell us something of that data? -- underground experts united https://dataswamp.org/~incal