GPFS Scans 10 Billion Files in 43 Minutes

By using a small cluster of ten IBM xSeries® servers, IBM's cluster file system (GPFS™), and by placing file system metadata on a new solid-state storage appliance from Violin Memory, IBM Research demonstrated, for the first time, the ability to do policy-guided storage management (daily tasks such as file selection for backup, migration, etc.) for a 10-billion-file environment in 43 minutes. This new record shatters previous record by factor of 37. GPFS also set the previous record in 2007.

The information processing power consumed by leading business, government and scientific organizations continues to grow at a phenomenal rate (90% CAGR). This growth is one of the factors driving the growth in data repositories. The amount of online digital data should exceed 1800 EB by the end of 2011 and continue to grow at a rate 40-60% per year [1]. This explosive growth of data, transactions and digitally aware devices is straining IT infrastructure and regular data management operations. The task of managing storage: backing up, migrating to appropriate performance tiers, replication and distribution is overburdening this infrastructure. It is not possible with existing solutions to manage 10 billion files actively today.

Unfortunately, the performance of the current, commonly-used storage device -- the disk drive -- is not keeping pace with the rate of performance growth needed by business and HPC systems. Recent advances in solid-state storage technology deliver significant performance improvement and performance density improvement, which is suitable for future storage systems matched to the needs of a growing information environment.

This document describes a demonstration that shows GPFS taking 43 minutes to process the 6.5 TBs of metadata needed for a file system containing 10 Billion files. This accomplishment combines the use of enhanced algorithms in GPFS with the use of solid-state storage as the GPFS metadata store. IBM Research once again breaks the barrier of GPFS scalability to scale out to an unprecedented file system size and enable much larger data environments to be unified on a single platform and dramatically reduce and simplify the data management tasks, such as data placement, aging, backup and replication.

By: Richard F. Freitas; Joe Slember; Wayne Sawdon; Lawrence Chiu

Published in: RJ10484 in 2011

LIMITED DISTRIBUTION NOTICE:

This Research Report is available. This report has been submitted for publication outside of IBM and will probably be copyrighted if accepted for publication. It has been issued as a Research Report for early dissemination of its contents. In view of the transfer of copyright to the outside publisher, its distribution outside of IBM prior to publication should be limited to peer communications and specific requests. After outside publication, requests should be filled only by reprints or legally obtained copies of the article (e.g., payment of royalties). I have read and understand this notice and am a member of the scientific community outside or inside of IBM seeking a single copy only.

rj10484.pdf

Questions about this service can be mailed to reports@us.ibm.com .