Skip to content

Commit 12a824d

Browse files
fdmananakdave
authored andcommitted
btrfs: speedup checking for extent sharedness during fiemap
One of the most expensive tasks performed during fiemap is to check if an extent is shared. This task has two major steps: 1) Check if the data extent is shared. This implies checking the extent item in the extent tree, checking delayed references, etc. If we find the data extent is directly shared, we terminate immediately; 2) If the data extent is not directly shared (its extent item has a refcount of 1), then it may be shared if we have snapshots that share subtrees of the inode's subvolume b+tree. So we check if the leaf containing the file extent item is shared, then its parent node, then the parent node of the parent node, etc, until we reach the root node or we find one of them is shared - in which case we stop immediately. During fiemap we process the extents of a file from left to right, from file offset 0 to EOF. This means that we iterate b+tree leaves from left to right, and has the implication that we keep repeating that second step above several times for the same b+tree path of the inode's subvolume b+tree. For example, if we have two file extent items in leaf X, and the path to leaf X is A -> B -> C -> X, then when we try to determine if the data extent referenced by the first extent item is shared, we check if the data extent is shared - if it's not, then we check if leaf X is shared, if not, then we check if node C is shared, if not, then check if node B is shared, if not than check if node A is shared. When we move to the next file extent item, after determining the data extent is not shared, we repeat the checks for X, C, B and A - doing all the expensive searches in the extent tree, delayed refs, etc. If we have thousands of tile extents, then we keep repeating the sharedness checks for the same paths over and over. On a file that has no shared extents or only a small portion, it's easy to see that this scales terribly with the number of extents in the file and the sizes of the extent and subvolume b+trees. This change eliminates the repeated sharedness check on extent buffers by caching the results of the last path used. The results can be used as long as no snapshots were created since they were cached (for not shared extent buffers) or no roots were dropped since they were cached (for shared extent buffers). This greatly reduces the time spent by fiemap for files with thousands of extents and/or large extent and subvolume b+trees. Example performance test: $ cat fiemap-perf-test.sh #!/bin/bash DEV=/dev/sdi MNT=/mnt/sdi mkfs.btrfs -f $DEV mount -o compress=lzo $DEV $MNT # 40G gives 327680 128K file extents (due to compression). xfs_io -f -c "pwrite -S 0xab -b 1M 0 40G" $MNT/foobar umount $MNT mount -o compress=lzo $DEV $MNT start=$(date +%s%N) filefrag $MNT/foobar end=$(date +%s%N) dur=$(( (end - start) / 1000000 )) echo "fiemap took $dur milliseconds (metadata not cached)" start=$(date +%s%N) filefrag $MNT/foobar end=$(date +%s%N) dur=$(( (end - start) / 1000000 )) echo "fiemap took $dur milliseconds (metadata cached)" umount $MNT Before this patch: $ ./fiemap-perf-test.sh (...) /mnt/sdi/foobar: 327680 extents found fiemap took 3597 milliseconds (metadata not cached) /mnt/sdi/foobar: 327680 extents found fiemap took 2107 milliseconds (metadata cached) After this patch: $ ./fiemap-perf-test.sh (...) /mnt/sdi/foobar: 327680 extents found fiemap took 1646 milliseconds (metadata not cached) /mnt/sdi/foobar: 327680 extents found fiemap took 698 milliseconds (metadata cached) That's about 2.2x faster when no metadata is cached, and about 3x faster when all metadata is cached. On a real filesystem with many other files, data, directories, etc, the b+trees will be 2 or 3 levels higher, therefore this optimization will have a higher impact. Several reports of a slow fiemap show up often, the two Link tags below refer to two recent reports of such slowness. This patch, together with the next ones in the series, is meant to address that. Link: https://lore.kernel.org/linux-btrfs/[email protected]/ Link: https://lore.kernel.org/linux-btrfs/[email protected]/ Reviewed-by: Josef Bacik <[email protected]> Signed-off-by: Filipe Manana <[email protected]> Signed-off-by: David Sterba <[email protected]>
1 parent 8eedadd commit 12a824d

File tree

5 files changed

+170
-8
lines changed

5 files changed

+170
-8
lines changed

fs/btrfs/backref.c

+121-1
Original file line numberDiff line numberDiff line change
@@ -1511,6 +1511,105 @@ int btrfs_find_all_roots(struct btrfs_trans_handle *trans,
15111511
return ret;
15121512
}
15131513

1514+
/*
1515+
* The caller has joined a transaction or is holding a read lock on the
1516+
* fs_info->commit_root_sem semaphore, so no need to worry about the root's last
1517+
* snapshot field changing while updating or checking the cache.
1518+
*/
1519+
static bool lookup_backref_shared_cache(struct btrfs_backref_shared_cache *cache,
1520+
struct btrfs_root *root,
1521+
u64 bytenr, int level, bool *is_shared)
1522+
{
1523+
struct btrfs_backref_shared_cache_entry *entry;
1524+
1525+
if (WARN_ON_ONCE(level >= BTRFS_MAX_LEVEL))
1526+
return false;
1527+
1528+
/*
1529+
* Level -1 is used for the data extent, which is not reliable to cache
1530+
* because its reference count can increase or decrease without us
1531+
* realizing. We cache results only for extent buffers that lead from
1532+
* the root node down to the leaf with the file extent item.
1533+
*/
1534+
ASSERT(level >= 0);
1535+
1536+
entry = &cache->entries[level];
1537+
1538+
/* Unused cache entry or being used for some other extent buffer. */
1539+
if (entry->bytenr != bytenr)
1540+
return false;
1541+
1542+
/*
1543+
* We cached a false result, but the last snapshot generation of the
1544+
* root changed, so we now have a snapshot. Don't trust the result.
1545+
*/
1546+
if (!entry->is_shared &&
1547+
entry->gen != btrfs_root_last_snapshot(&root->root_item))
1548+
return false;
1549+
1550+
/*
1551+
* If we cached a true result and the last generation used for dropping
1552+
* a root changed, we can not trust the result, because the dropped root
1553+
* could be a snapshot sharing this extent buffer.
1554+
*/
1555+
if (entry->is_shared &&
1556+
entry->gen != btrfs_get_last_root_drop_gen(root->fs_info))
1557+
return false;
1558+
1559+
*is_shared = entry->is_shared;
1560+
1561+
return true;
1562+
}
1563+
1564+
/*
1565+
* The caller has joined a transaction or is holding a read lock on the
1566+
* fs_info->commit_root_sem semaphore, so no need to worry about the root's last
1567+
* snapshot field changing while updating or checking the cache.
1568+
*/
1569+
static void store_backref_shared_cache(struct btrfs_backref_shared_cache *cache,
1570+
struct btrfs_root *root,
1571+
u64 bytenr, int level, bool is_shared)
1572+
{
1573+
struct btrfs_backref_shared_cache_entry *entry;
1574+
u64 gen;
1575+
1576+
if (WARN_ON_ONCE(level >= BTRFS_MAX_LEVEL))
1577+
return;
1578+
1579+
/*
1580+
* Level -1 is used for the data extent, which is not reliable to cache
1581+
* because its reference count can increase or decrease without us
1582+
* realizing. We cache results only for extent buffers that lead from
1583+
* the root node down to the leaf with the file extent item.
1584+
*/
1585+
ASSERT(level >= 0);
1586+
1587+
if (is_shared)
1588+
gen = btrfs_get_last_root_drop_gen(root->fs_info);
1589+
else
1590+
gen = btrfs_root_last_snapshot(&root->root_item);
1591+
1592+
entry = &cache->entries[level];
1593+
entry->bytenr = bytenr;
1594+
entry->is_shared = is_shared;
1595+
entry->gen = gen;
1596+
1597+
/*
1598+
* If we found an extent buffer is shared, set the cache result for all
1599+
* extent buffers below it to true. As nodes in the path are COWed,
1600+
* their sharedness is moved to their children, and if a leaf is COWed,
1601+
* then the sharedness of a data extent becomes direct, the refcount of
1602+
* data extent is increased in the extent item at the extent tree.
1603+
*/
1604+
if (is_shared) {
1605+
for (int i = 0; i < level; i++) {
1606+
entry = &cache->entries[i];
1607+
entry->is_shared = is_shared;
1608+
entry->gen = gen;
1609+
}
1610+
}
1611+
}
1612+
15141613
/*
15151614
* Check if a data extent is shared or not.
15161615
*
@@ -1519,6 +1618,7 @@ int btrfs_find_all_roots(struct btrfs_trans_handle *trans,
15191618
* @bytenr: logical bytenr of the extent we are checking
15201619
* @roots: list of roots this extent is shared among
15211620
* @tmp: temporary list used for iteration
1621+
* @cache: a backref lookup result cache
15221622
*
15231623
* btrfs_is_data_extent_shared uses the backref walking code but will short
15241624
* circuit as soon as it finds a root or inode that doesn't match the
@@ -1532,7 +1632,8 @@ int btrfs_find_all_roots(struct btrfs_trans_handle *trans,
15321632
* Return: 0 if extent is not shared, 1 if it is shared, < 0 on error.
15331633
*/
15341634
int btrfs_is_data_extent_shared(struct btrfs_root *root, u64 inum, u64 bytenr,
1535-
struct ulist *roots, struct ulist *tmp)
1635+
struct ulist *roots, struct ulist *tmp,
1636+
struct btrfs_backref_shared_cache *cache)
15361637
{
15371638
struct btrfs_fs_info *fs_info = root->fs_info;
15381639
struct btrfs_trans_handle *trans;
@@ -1545,6 +1646,7 @@ int btrfs_is_data_extent_shared(struct btrfs_root *root, u64 inum, u64 bytenr,
15451646
.inum = inum,
15461647
.share_count = 0,
15471648
};
1649+
int level;
15481650

15491651
ulist_init(roots);
15501652
ulist_init(tmp);
@@ -1561,22 +1663,40 @@ int btrfs_is_data_extent_shared(struct btrfs_root *root, u64 inum, u64 bytenr,
15611663
btrfs_get_tree_mod_seq(fs_info, &elem);
15621664
}
15631665

1666+
/* -1 means we are in the bytenr of the data extent. */
1667+
level = -1;
15641668
ULIST_ITER_INIT(&uiter);
15651669
while (1) {
1670+
bool is_shared;
1671+
bool cached;
1672+
15661673
ret = find_parent_nodes(trans, fs_info, bytenr, elem.seq, tmp,
15671674
roots, NULL, &shared, false);
15681675
if (ret == BACKREF_FOUND_SHARED) {
15691676
/* this is the only condition under which we return 1 */
15701677
ret = 1;
1678+
if (level >= 0)
1679+
store_backref_shared_cache(cache, root, bytenr,
1680+
level, true);
15711681
break;
15721682
}
15731683
if (ret < 0 && ret != -ENOENT)
15741684
break;
15751685
ret = 0;
1686+
if (level >= 0)
1687+
store_backref_shared_cache(cache, root, bytenr,
1688+
level, false);
15761689
node = ulist_next(tmp, &uiter);
15771690
if (!node)
15781691
break;
15791692
bytenr = node->val;
1693+
level++;
1694+
cached = lookup_backref_shared_cache(cache, root, bytenr, level,
1695+
&is_shared);
1696+
if (cached) {
1697+
ret = (is_shared ? 1 : 0);
1698+
break;
1699+
}
15801700
shared.share_count = 0;
15811701
cond_resched();
15821702
}

fs/btrfs/backref.h

+16-1
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,20 @@ struct inode_fs_paths {
1717
struct btrfs_data_container *fspath;
1818
};
1919

20+
struct btrfs_backref_shared_cache_entry {
21+
u64 bytenr;
22+
u64 gen;
23+
bool is_shared;
24+
};
25+
26+
struct btrfs_backref_shared_cache {
27+
/*
28+
* A path from a root to a leaf that has a file extent item pointing to
29+
* a given data extent should never exceed the maximum b+tree height.
30+
*/
31+
struct btrfs_backref_shared_cache_entry entries[BTRFS_MAX_LEVEL];
32+
};
33+
2034
typedef int (iterate_extent_inodes_t)(u64 inum, u64 offset, u64 root,
2135
void *ctx);
2236

@@ -63,7 +77,8 @@ int btrfs_find_one_extref(struct btrfs_root *root, u64 inode_objectid,
6377
struct btrfs_inode_extref **ret_extref,
6478
u64 *found_off);
6579
int btrfs_is_data_extent_shared(struct btrfs_root *root, u64 inum, u64 bytenr,
66-
struct ulist *roots, struct ulist *tmp);
80+
struct ulist *roots, struct ulist *tmp,
81+
struct btrfs_backref_shared_cache *cache);
6782

6883
int __init btrfs_prelim_ref_init(void);
6984
void __cold btrfs_prelim_ref_exit(void);

fs/btrfs/ctree.h

+18
Original file line numberDiff line numberDiff line change
@@ -1089,6 +1089,13 @@ struct btrfs_fs_info {
10891089
/* Updates are not protected by any lock */
10901090
struct btrfs_commit_stats commit_stats;
10911091

1092+
/*
1093+
* Last generation where we dropped a non-relocation root.
1094+
* Use btrfs_set_last_root_drop_gen() and btrfs_get_last_root_drop_gen()
1095+
* to change it and to read it, respectively.
1096+
*/
1097+
u64 last_root_drop_gen;
1098+
10921099
/*
10931100
* Annotations for transaction events (structures are empty when
10941101
* compiled without lockdep).
@@ -1113,6 +1120,17 @@ struct btrfs_fs_info {
11131120
#endif
11141121
};
11151122

1123+
static inline void btrfs_set_last_root_drop_gen(struct btrfs_fs_info *fs_info,
1124+
u64 gen)
1125+
{
1126+
WRITE_ONCE(fs_info->last_root_drop_gen, gen);
1127+
}
1128+
1129+
static inline u64 btrfs_get_last_root_drop_gen(const struct btrfs_fs_info *fs_info)
1130+
{
1131+
return READ_ONCE(fs_info->last_root_drop_gen);
1132+
}
1133+
11161134
static inline struct btrfs_fs_info *btrfs_sb(struct super_block *sb)
11171135
{
11181136
return sb->s_fs_info;

fs/btrfs/extent-tree.c

+9-1
Original file line numberDiff line numberDiff line change
@@ -5635,6 +5635,8 @@ static noinline int walk_up_tree(struct btrfs_trans_handle *trans,
56355635
*/
56365636
int btrfs_drop_snapshot(struct btrfs_root *root, int update_ref, int for_reloc)
56375637
{
5638+
const bool is_reloc_root = (root->root_key.objectid ==
5639+
BTRFS_TREE_RELOC_OBJECTID);
56385640
struct btrfs_fs_info *fs_info = root->fs_info;
56395641
struct btrfs_path *path;
56405642
struct btrfs_trans_handle *trans;
@@ -5794,6 +5796,9 @@ int btrfs_drop_snapshot(struct btrfs_root *root, int update_ref, int for_reloc)
57945796
goto out_end_trans;
57955797
}
57965798

5799+
if (!is_reloc_root)
5800+
btrfs_set_last_root_drop_gen(fs_info, trans->transid);
5801+
57975802
btrfs_end_transaction_throttle(trans);
57985803
if (!for_reloc && btrfs_need_cleaner_sleep(fs_info)) {
57995804
btrfs_debug(fs_info,
@@ -5828,7 +5833,7 @@ int btrfs_drop_snapshot(struct btrfs_root *root, int update_ref, int for_reloc)
58285833
goto out_end_trans;
58295834
}
58305835

5831-
if (root->root_key.objectid != BTRFS_TREE_RELOC_OBJECTID) {
5836+
if (!is_reloc_root) {
58325837
ret = btrfs_find_root(tree_root, &root->root_key, path,
58335838
NULL, NULL);
58345839
if (ret < 0) {
@@ -5860,6 +5865,9 @@ int btrfs_drop_snapshot(struct btrfs_root *root, int update_ref, int for_reloc)
58605865
btrfs_put_root(root);
58615866
root_dropped = true;
58625867
out_end_trans:
5868+
if (!is_reloc_root)
5869+
btrfs_set_last_root_drop_gen(fs_info, trans->transid);
5870+
58635871
btrfs_end_transaction_throttle(trans);
58645872
out_free:
58655873
kfree(wc);

fs/btrfs/extent_io.c

+6-5
Original file line numberDiff line numberDiff line change
@@ -5448,20 +5448,19 @@ int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
54485448
struct btrfs_path *path;
54495449
struct btrfs_root *root = inode->root;
54505450
struct fiemap_cache cache = { 0 };
5451+
struct btrfs_backref_shared_cache *backref_cache;
54515452
struct ulist *roots;
54525453
struct ulist *tmp_ulist;
54535454
int end = 0;
54545455
u64 em_start = 0;
54555456
u64 em_len = 0;
54565457
u64 em_end = 0;
54575458

5459+
backref_cache = kzalloc(sizeof(*backref_cache), GFP_KERNEL);
54585460
path = btrfs_alloc_path();
5459-
if (!path)
5460-
return -ENOMEM;
5461-
54625461
roots = ulist_alloc(GFP_KERNEL);
54635462
tmp_ulist = ulist_alloc(GFP_KERNEL);
5464-
if (!roots || !tmp_ulist) {
5463+
if (!backref_cache || !path || !roots || !tmp_ulist) {
54655464
ret = -ENOMEM;
54665465
goto out_free_ulist;
54675466
}
@@ -5587,7 +5586,8 @@ int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
55875586
*/
55885587
ret = btrfs_is_data_extent_shared(root, btrfs_ino(inode),
55895588
bytenr, roots,
5590-
tmp_ulist);
5589+
tmp_ulist,
5590+
backref_cache);
55915591
if (ret < 0)
55925592
goto out_free;
55935593
if (ret)
@@ -5639,6 +5639,7 @@ int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
56395639
&cached_state);
56405640

56415641
out_free_ulist:
5642+
kfree(backref_cache);
56425643
btrfs_free_path(path);
56435644
ulist_free(roots);
56445645
ulist_free(tmp_ulist);

0 commit comments

Comments
 (0)