I would definitely recommend you to always use UTF-8 or Unicode or something similar that is able to handle also characters missing in other alphabets. Because sooner or later you get into “mostly Latin but with a short Arabic sentence” or similar.
However, the world is not perfect and I was already in a similar situation:
-
Apache Tika can do a lot of things: CharsetDetector (Apache Tika 1.3 API)
-
I would recommend, based on my own experience: Google Code Archive - Long-term storage for Google Code Project Hosting.