Skip to content

Client hung after end of results during heavy load #285

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
encryptio opened this issue Oct 9, 2014 · 19 comments · Fixed by #321
Closed

Client hung after end of results during heavy load #285

encryptio opened this issue Oct 9, 2014 · 19 comments · Fixed by #321
Labels
Milestone

Comments

@encryptio
Copy link

I'm running many concurrent transactions (half read, half write, often with statements which cause large write locks on rows/index pages) with tx_isolation=serializable, and rerunning transactions where any statement returns a *MySQLError with Number == ER_LOCK_DEADLOCK (1213). There are very often deadlock errors (~0.015 retries per completed transaction), which is expected under the kind of queries and load I'm putting on the system.

This code runs fine for a while, but occasionally (more often in heavy load), a client connection will get locked up waiting for an extra packet at the end of a result list that the server doesn't think needs to be sent. Unfortunately, this happens a hell of a lot under heavy load (once every couple of minutes at 100 transactions per second sustained.)

The client is blocked (forever) on:

1 @ 0x4237d6 0x42e4c6 0x42de2a 0x52c314 0x52c370 0x52d6f0 0x53cac5 0x4bceac 0x4bcfef 0x4c0ad1 0x4c4347 0x4c879f 0x49f3b8 0x49ec6f 0x490a1d 0x48ca2f 0x406ce3 0x406b56 0x48eaad 0x48e865 0x4066d7 0x4239a0
#   0x42e4c6    netpollblock+0xa6                           /usr/lib/golang/src/pkg/runtime/netpoll.goc:280
#   0x42de2a    net.runtime_pollWait+0x6a                       /usr/lib/golang/src/pkg/runtime/netpoll.goc:116
#   0x52c314    net.(*pollDesc).Wait+0x34                       /usr/lib/golang/src/pkg/net/fd_poll_runtime.go:81
#   0x52c370    net.(*pollDesc).WaitRead+0x30                       /usr/lib/golang/src/pkg/net/fd_poll_runtime.go:86
#   0x52d6f0    net.(*netFD).Read+0x2a0                         /usr/lib/golang/src/pkg/net/fd_unix.go:204
#   0x53cac5    net.(*conn).Read+0xc5                           /usr/lib/golang/src/pkg/net/net.go:122
#   0x4bceac    github.com/go-sql-driver/mysql.(*buffer).fill+0x1dc         /home/croll/local/go/src/github.com/go-sql-driver/mysql/buffer.go:55
#   0x4bcfef    github.com/go-sql-driver/mysql.(*buffer).readNext+0x3f          /home/croll/local/go/src/github.com/go-sql-driver/mysql/buffer.go:76
#   0x4c0ad1    github.com/go-sql-driver/mysql.(*mysqlConn).readPacket+0x61     /home/croll/local/go/src/github.com/go-sql-driver/mysql/packets.go:30
#   0x4c4347    github.com/go-sql-driver/mysql.(*mysqlConn).readUntilEOF+0x27       /home/croll/local/go/src/github.com/go-sql-driver/mysql/packets.go:641
#   0x4c879f    github.com/go-sql-driver/mysql.(*mysqlRows).Close+0x6f          /home/croll/local/go/src/github.com/go-sql-driver/mysql/rows.go:53
#   0x49f3b8    database/sql.(*Rows).Close+0x78                     /usr/lib/golang/src/pkg/database/sql/sql.go:1580
#   0x49ec6f    database/sql.(*Rows).Next+0xdf                      /usr/lib/golang/src/pkg/database/sql/sql.go:1511
#   0x490a1d    go.codemobs.com/gom/db/mysql.ScanRowsToObjects+0x4d         /home/croll/local/go/src/go.codemobs.com/gom/db/mysql/map.go:104
#   0x48ca2f    go.codemobs.com/gom/db/mysql.(*Ctx).Get+0x2ff               /home/croll/local/go/src/go.codemobs.com/gom/db/mysql/ctx.go:35
#   0x406ce3    main.scheduleJob+0xb3                           /home/croll/local/go/src/go.codemobs.com/vps/transcode-master/scheduler.go:100
#   0x406b56    main.scheduleJobs+0x3c6                         /home/croll/local/go/src/go.codemobs.com/vps/transcode-master/scheduler.go:85
#   0x48eaad    go.codemobs.com/gom/db/mysql.(*MySQLBackend).runTxWithRetryCount+0x14d  /home/croll/local/go/src/go.codemobs.com/gom/db/mysql/db.go:81
#   0x48e865    go.codemobs.com/gom/db/mysql.(*MySQLBackend).RunTx+0x65         /home/croll/local/go/src/go.codemobs.com/gom/db/mysql/db.go:54
#   0x4066d7    main.schedulerLoop+0x57                         /home/croll/local/go/src/go.codemobs.com/vps/transcode-master/scheduler.go:26

But the server thinks the connection is idle:

mysql> show processlist;
+--------+------+-----------------+--------+---------+------+-------+------------------+
| Id     | User | Host            | db     | Command | Time | State | Info             |
+--------+------+-----------------+--------+---------+------+-------+------------------+
| 162017 | root | localhost:45301 | master | Sleep   |   14 |       | NULL             |
| 162024 | root | localhost       | NULL   | Query   |    0 | NULL  | show processlist |
+--------+------+-----------------+--------+---------+------+-------+------------------+

It's not specific to the scheduler transactions either; it appears to happen randomly, roughly in proportion to what I expect the number of rows locked in each transaction is.

I suspect either the mysqld implementation or the go mysql driver have a bug somewhere in their protocol handling, but I don't know where to go from here.

Using go-sql-driver/mysql as of 9543750 Merge pull request #280 from hyandell/master. I can make this happen reliably with MySQL 5.5.37 and 10.0.14-MariaDB, and with Go 1.2.2 and Go 1.3.3, and have not found a version of anything on which it does not lock up.

@arnehormann
Copy link
Member

Hi @encryptio, can you give us a program reliably reproducing this?
It doesn't need to be the one you use.

@encryptio
Copy link
Author

I've been trying to make a minimal reproducer, and in doing so tried to find a series of calls that caused it. That's not done yet, but I did find something interesting after creating a fake mysql driver that logs all the calls made to it and passes them on to go-sql-driver/mysql:

All lockups occur when driver.Rows.Next() has returned several rows, is called again, and that returns a deadlock error ((*MySQLError).Number == 1213), then has driver.Rows.Close() called immediately after, which hangs. Perhaps that can point you in the right direction as I work on this further.

@encryptio
Copy link
Author

I was able to fix it with this patch:

encryptio@91054db

It changes the logic of binaryRows.Next to consider both io.EOF and any MySQLError to cause rows.mc = nil, which avoids the mc.readUntilEOF() call in mysqlRows.Close.

Similar logic probably needs to be applied to the textRows object, but I'm less sure about that because I can't test it.

@encryptio
Copy link
Author

I've spent several hours trying to build a small reproducer, but haven't had success yet. I have other priorities for the moment, and hope to come back to this starting next week.

@arnehormann
Copy link
Member

@encryptio thank you! Your analysis definitely helps, maybe we can even get rid of more than one issue at once. I also don't have the time right now to really dig into it.
I'm on the fence if we can have a test or if we should just fix it. Also, if such a test involves calling runtime.GOMAXPROCS(x), it should not be part of the standard test suite. Having one would still be great.

@encryptio
Copy link
Author

I think my patch is leaking connections when the deadlock error occurs. Not sure why, but at least the user code continues.

@arnehormann
Copy link
Member

Probably because the net connection is not closed. You only set rows.mc to nil, but rows.mc.netConn is still open.

@encryptio
Copy link
Author

I've been looking into this further; closing the netConn explicitly should fix the connections leaking issue, but that doesn't seem right either: If this happens in a transaction (which is the case in my code), the transaction shouldn't be forfeited; it's a non-fatal error, and you can continue to send (different) queries to the transaction (IMO, that's a misdesign of the MySQL protocol, but alas.) Why is database/sql not continuing to use the mysqlConn with my patch?

When the deadlock error is returned from tx.Query/tx.Exec directly (that is, not after row iteration starts) it does allow tx.Rollback to reuse the connection later.

Unfortunately, I can't reproduce the leak anymore with my patch applied (with no version changes, afaik), yet the current master branch of the mysql driver still locks up.

@zhujianfeng
Copy link

Same problem. but my code just read mysql, no write. And blocked on :

goroutine 21831 [IO wait, 4 minutes]:
net.runtime_pollWait(0x7f51db9db210, 0x72, 0x0)
    /usr/local/go/src/pkg/runtime/netpoll.goc:146 +0x66
net.(*pollDesc).Wait(0xc28ef84b50, 0x72, 0x0, 0x0)
    /usr/local/go/src/pkg/net/fd_poll_runtime.go:84 +0x46
net.(*pollDesc).WaitRead(0xc28ef84b50, 0x0, 0x0)
    /usr/local/go/src/pkg/net/fd_poll_runtime.go:89 +0x42
net.(*netFD).Read(0xc28ef84af0, 0xc28801f000, 0x1000, 0x1000, 0x0, 0x7f51db9bc418, 0xb)
    /usr/local/go/src/pkg/net/fd_unix.go:232 +0x34c
net.(*conn).Read(0xc254d84c28, 0xc28801f000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
    /usr/local/go/src/pkg/net/net.go:122 +0xe7
github.com/go-sql-driver/mysql.(*buffer).fill(0xc2a1f44300, 0x4, 0x0, 0x0)
    /home/uaq/opbin/zhujianfeng/webstat/src/github.com/go-sql-driver/mysql/buffer.go:57 +0x271
github.com/go-sql-driver/mysql.(*buffer).readNext(0xc2a1f44300, 0x4, 0x0, 0x0, 0x0, 0x0, 0x0)
    /home/uaq/opbin/zhujianfeng/webstat/src/github.com/go-sql-driver/mysql/buffer.go:86 +0x6c
github.com/go-sql-driver/mysql.(*mysqlConn).readPacket(0xc2a1f44300, 0x0, 0x0, 0x0, 0x0, 0x0)
    /home/uaq/opbin/zhujianfeng/webstat/src/github.com/go-sql-driver/mysql/packets.go:30 +0x9d
github.com/go-sql-driver/mysql.(*mysqlConn).readInitPacket(0xc2a1f44300, 0x0, 0x0, 0x0, 0x0, 0x0)
    /home/uaq/opbin/zhujianfeng/webstat/src/github.com/go-sql-driver/mysql/packets.go:131 +0x71
github.com/go-sql-driver/mysql.MySQLDriver.Open(0xc208040480, 0x39, 0x0, 0x0, 0x0, 0x0)
    /home/uaq/opbin/zhujianfeng/webstat/src/github.com/go-sql-driver/mysql/driver.go:83 +0x3ec
github.com/go-sql-driver/mysql.(*MySQLDriver).Open(0xd39280, 0xc208040480, 0x39, 0x0, 0x0, 0x0, 0x0)
    <autogenerated>:9 +0xc5
database/sql.(*DB).conn(0xc208156000, 0x0, 0x0, 0x0)
    /usr/local/go/src/pkg/database/sql/sql.go:666 +0x4f5
database/sql.(*DB).query(0xc208156000, 0xc27e285680, 0xe1, 0x0, 0x0, 0x0, 0x40eb54, 0x0, 0x0)
    /usr/local/go/src/pkg/database/sql/sql.go:924 +0x44
database/sql.(*DB).Query(0xc208156000, 0xc27e285680, 0xe1, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
    /usr/local/go/src/pkg/database/sql/sql.go:915 +0xad
github.com/astaxie/beego/orm.(*rawSet).readValues(0xc2b1d5b020, 0x823940, 0xc250ea4640, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
    /home/uaq/opbin/zhujianfeng/webstat/src/github.com/astaxie/beego/orm/orm_raw.go:560 +0x286
github.com/astaxie/beego/orm.(*rawSet).Values(0xc2b1d5b020, 0xc250ea4640, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
    /home/uaq/opbin/zhujianfeng/webstat/src/github.com/astaxie/beego/orm/orm_raw.go:779 +0x83
webstat/util.execSql(0xc260ecfa40, 0xb, 0xa6a750, 0x15, 0xc290b1944c, 0x2, 0xc2ab79a8c0, 0x46, 0x9e80d0, 0x0, ...)
    /home/uaq/opbin/zhujianfeng/webstat/src/webstat/util/sql.go:114 +0x349
webstat/util.RunSql(0xc260ecfa40, 0xb, 0xa6a750, 0x15, 0xc290b1944c, 0x2, 0xc2ab79a8c0, 0x46, 0x9e80d0, 0x0, ...)
    /home/uaq/opbin/zhujianfeng/webstat/src/webstat/util/sql.go:34 +0x5d4
webstat/util.FetchData(0xc290b1944c, 0x5, 0xc260ecfa40, 0xb, 0xc290b1947e, 0x1, 0xc290b19480, 0x6, 0x4055400000000000, 0x4014000000000000, ...)
    /home/uaq/opbin/zhujianfeng/webstat/src/webstat/util/modfuncs.go:179 +0x786
webstat/models/ispmodel.Worker(0xc290b1944c, 0x5, 0xc290b19473, 0xa, 0xc290b1947e, 0x1, 0xc290b19480, 0x6, 0x4055400000000000, 0x4014000000000000, ...)
    /home/uaq/opbin/zhujianfeng/webstat/src/webstat/models/ispmodel/isp.go:35 +0xf1
created by webstat/controllers.(*IspController).statData
    /home/uaq/opbin/zhujianfeng/webstat/src/webstat/controllers/isp.go:58 +0x27c

which is

    // Reading Handshake Initialization Packet
    cipher, err := mc.readInitPacket()

@julienschmidt julienschmidt added this to the v1.3 milestone Feb 13, 2015
@julienschmidt
Copy link
Member

Can one of you provide us a small sample program with which this deadlock occurs?
I want this to be fixed for v1.3

@methane
Copy link
Member

methane commented Feb 13, 2015

Are there any messages in MySQL's error log?

@ikkeps
Copy link

ikkeps commented Feb 24, 2015

Hi! I have similar problem in production, and after couple of days of debugging, realized that everything is just like @encryptio says. But, in our case error codes are 1885 and 1028 (query execution timeout related). Here is minimal code to reproduce bug - https://gist.github.com/ikkeps/058e090b561add81a910 (it uses set max_statement_time=... to force myqsl to throw error. Yes, simple db.Exec is not really reliable here, since client can reconnect at any time, but I did not found better way to do it.) Also i did not see any messages in mysql log except many of

Sort aborted: Query execution was interrupted, max_statement_time exceeded

@arnehormann
Copy link
Member

@ikkeps if you use an older MySQL version on your server, that query can easily exceed a 1 second timeout. It takes 12 seconds on one of our machines.
Which makes max_statement_time exceeded very plausible - this does not reproduce the bug.
You could also just call SELECT SLEEP(2) instead.

@ikkeps
Copy link

ikkeps commented Feb 24, 2015

@arnehormann that was the point (also max_statement_time is in milliseconds). Driver hangs when server returns some error (I am not sure about other errors, but with timeout-related errors it almost always does), and it hangs when trying to read all remaining rows on Close() (with the same stack as in @encryptio description), but, as far as I can see mysql does not returns any additional rows, and driver waits almost forever (until mysql drops connection on wait_timeout). We use MySQL 5.6.21-70.1-log (Percona Server 70.1) if this helps.

@arnehormann
Copy link
Member

@ikkeps ok. Apparently I don't have an instance supporting max_statement_time right now. Have to test later...

@methane
Copy link
Member

methane commented Feb 24, 2015

This may fix the hung https://gist.github.com/methane/6c6e2db8464c0579a7af

@julienschmidt
Copy link
Member

Please notify us in case this problem still persists.

@ikkeps
Copy link

ikkeps commented Feb 25, 2015

At first look this does not fix the bug (at least for max_statement_time case, which, still, could be unrelated to original bug described). I started wireshark and, as far as I can see, driver hungs when error happens after some rows successfully readed, and driver tries to Close() rows (and tries to read rows until EOF). Still almost the same stack as in original error.

net.runtime_pollWait(0x7f77ba378530, 0x72, 0x0)
    /usr/local/go/src/pkg/runtime/netpoll.goc:146 +0x66
net.(*pollDesc).Wait(0xc2080261b0, 0x72, 0x0, 0x0)
    /home/developer/opt/go/src/pkg/net/fd_poll_runtime.go:84 +0x46
net.(*pollDesc).WaitRead(0xc2080261b0, 0x0, 0x0)
    /home/developer/opt/go/src/pkg/net/fd_poll_runtime.go:89 +0x42
net.(*netFD).Read(0xc208026150, 0xc208042000, 0x1000, 0x1000, 0x0, 0x7f77ba3773f0, 0xb)
    /home/developer/opt/go/src/pkg/net/fd_unix.go:242 +0x34c
net.(*conn).Read(0xc20803a020, 0xc208042000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
    /home/developer/opt/go/src/pkg/net/net.go:122 +0xe7
github.com/go-sql-driver/mysql.(*buffer).fill(0xc208046200, 0x4, 0x0, 0x0)
    /home/developer/gocode/src/github.com/go-sql-driver/mysql/buffer.go:57 +0x271
github.com/go-sql-driver/mysql.(*buffer).readNext(0xc208046200, 0x4, 0x0, 0x0, 0x0, 0x0, 0x0)
    /home/developer/gocode/src/github.com/go-sql-driver/mysql/buffer.go:86 +0x6c
github.com/go-sql-driver/mysql.(*mysqlConn).readPacket(0xc208046200, 0x0, 0x0, 0x0, 0x0, 0x0)
    /home/developer/gocode/src/github.com/go-sql-driver/mysql/packets.go:30 +0x9d
github.com/go-sql-driver/mysql.(*mysqlConn).readUntilEOF(0xc208046200, 0x0, 0x0)
    /home/developer/gocode/src/github.com/go-sql-driver/mysql/packets.go:657 +0x39
github.com/go-sql-driver/mysql.(*mysqlRows).Close(0xc208040360, 0x0, 0x0)
    /home/developer/gocode/src/github.com/go-sql-driver/mysql/rows.go:63 +0x81
database/sql.(*Rows).Close(0xc208004360, 0x0, 0x0)
    /usr/local/go/src/pkg/database/sql/sql.go:1613 +0x9e
database/sql.(*Row).Scan(0xc208040380, 0x7f77ba1e8e68, 0x1, 0x1, 0x0, 0x0)
    /usr/local/go/src/pkg/database/sql/sql.go:1671 +0x3ed
main.main()

@methane
Copy link
Member

methane commented Feb 25, 2015

I (and Travic-CI) doesn't have MySQL 5.7 installed.
So I can't reproduce the hung up.
I want another way to reproduce it on MySQL 5.6...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants