There is a huge need for a standard corpus of high-quality, free-to-use demo data. When building search applications, for instance, getting your hands on actual data can be near impossible, forcing you to design for unrealistic situations and compromising the end result. Well-rounded demo data would help ensure you’re working towards the right target.
There is a second huge benefit in having a de facto corpus: benchmarking. Different retrieval platforms could be compared against a fixed standard, and academic studies would benefit from greater comparability.
What would this corpus look like? Above all, it must be open source. It should contain a critical mass of records (10,000 or greater), while still having a light footprint (under 100 megabytes). It should captured in a common standard (such as XML) which could easily be ingested by a range of technologies. Beyond that, the data should meet at least the following requirements:
- Descriptive titles. Records should have properly curated titles.
- Free text. There should be large chunks of text.
- Ranges. Numerical, range-based data such as ratings scores.
- Temporal. There should be dates and times.
- Spatial. Records should be geo-tagged or included addresses.
- Multi-value fields. Fields with more than one value, such as tags or keywords.
- Hierarchical. At least some of the records should belong to a taxonomy.
Nice to haves
- Public URL. It would be a bonus if records contained a publicly-resolvable URL.
- Public multimedia. It would also be a bonus if records contained references to publicly-available images or video.
What do you think?
Most importantly, what do you think? Have I carelessly overlooked an excellent corpus already in existence? If not, then how would you improve this list of requirements? Together, lets make something.