File shares will not exist forever. Microsoft will eventually stop supporting them. It is important that IT departments begin reviewing the objects from the “bottom up” (file to folder to sub directory to directory) yesterday. Information management professionals know best how to apply a set of internal (user-driven) and external (records retention schedule) criteria to determine what objects stay where they are, what objects move to other locations or custodians, and what objects are deleted.

CIO be wary: this is not a four month project.

Certain tools will advance the tasks and schedule quickly, though. For example, have an abandoned records policy in place -- if file custodians are gone, IT is empowered to make decisions on objects. Create a records retention schedule in partnership with the Legal function.

A file share cleanup project is linear from statistical analysis through the deduplication phases. Post deduplication, the world becomes a bit more complicated -- satisfyingly so.

Statistical Analysis

Statistical analysis is my favorite part of the project. Look to populate the following metadata per object:

Must Have   Nice to Have
  • File Name
  • File Extension
  • File Type
  • Volume
  • Size
  • Creation Time
  • Owner User Name
  • Last Access Time
  • Last Modified Time
  • Days Since Last Access
  • Days Since Last Modify
  • Days Since File Creation
  • Days Since Creation
  • File
  • Attributes
  • Read Only
  • Hidden
  • System Flag
  • Encrypted Flag
  • Not Content Indexed
  • Duplicate Key String

Be prepared to write an index that maps the keywords in a records retention schedule to file extensions and the types of files. A third party software will typically divide extensions into the following categories:

  • Miscellaneous Files (this is usually about 80 percent of the initial output because the tool isn’t industry-specific -- but you as administrator can sort them in the tool so that the next time you run the report, the extensions will more closely match their right category)
  • Container Files
  • Data Files
  • Text Files
  • Temporary and Backup Files
  • Graphic Files
  • System Files
  • PC Virtualization Files
  • Database Files
  • Office Files and Documents
  • Program Files
  • Internet Files
  • Software Development Files
  • Video Files
  • Configuration Files
  • Mail Files
  • Audio Files
  • Help Files
Miscellaneous Files 79.00% Unknown file types
Container Files 10.00% Compressed Archives and disk images
Data Files 4.60% Files containing data of various kinds, not including files of databases
Text Files 4.40% Plain text files, log files
Temporary and Backup Files 1.30% Temporary files and backup copies containing previous version of current files
Graphic Files 0.20% Files containing pictures, images or mouse cursors
System Files 0.20% System Files
PC Virtualization Files 0.20% Files of Virtual PC, VMWare, etc.
Database Files 0.10% Files containing the data of client and server databases
Office Files and Documents 0.00% Documents and files of office programs and PDFs
Program Files 0.00% Program Files, Libraries and other compiled resources
Internet Files 0.00% Files related to the WWW, like HTML files
Software Development Files 0.00% Source and project files of software development projects
Video Files 0.00% Files containing videos or animations
Configuration Files 0.00% Files containing configuration settings
Mail Files 0.00% Email messages and files of email clients
Audio Files 0.00% Files containing music, sounds or playlists
Help Files 0.00% Files of the Windows help system

A sorted list of extensions into the above categories can be written in an excel file -- a very helpful tool throughout the entire project.

Use the baseline results from the file share surveys to calculate when the growth of information exploded and whether or not shorter, more aggressive retention periods may reduce storage costs:


Remember to make the graph easily decipherable -- simple is elegant.

Survey the metadata of file shares at least once a month for three months to glean a general understanding of user behaviors before you launch conversations with content creators to de-duplicate. The “bottom up” approach may take a more scenic route in terms of project schedule, but the payoff with end users in the long run is tremendous.