• user warning: Table './urlgreyhot/cache_page' is marked as crashed and should be repaired query: LOCK TABLES cache_page WRITE in /home/admin/public_html/urlgreyhot.com/public/personal/includes/database.mysql.inc on line 174.
  • user warning: in /home/admin/public_html/urlgreyhot.com/public/personal/includes/database.mysql.inc on line 174.

Reuters Corpus

If your going to be doing some development of an information system and need a large corpus of sample data to test it on, Reuters makes large quantities of their news corpus available for testing and development purposes. The current corpus available for use is a one year period of XML marked up news that requires about 2.5 GB of storage. More information on this service:

bq. As a service to the research community Reuters is making available large quantities of Reuters News stories for use in research and development of natural-language-processing, information-retrieval or machine learning systems.

bq. We believe the corpus released through this site to be superior in quality and size to previously available corpus of Reuters News stories such as the Reuters 21578 corpus, which has previously been seen as a standard real-world benchmarking corpus for the IR/IE etc community. The Reuters corpus released through this site is marked up in XML which we believe will significantly aid processing.