Wednesday, November 20, 2013

Fast Random File Read With xp_read_file()

The description of the new READ_SERVER_FILE() procedure introduced in SQL Anywhere 16.0.0.1675 leads one to ask the question, "What makes READ_SERVER_FILE() different from the good old xp_read_file()?

One obvious difference is that READ_SERVER_FILE() lets you specify a substring of the file while xp_read_file() doesn't.

Another difference is that xp_read_file() has a mysterious "lazy" parameter that was introduced in Version 12 of SQL Anywhere:

Enhancements to the xp_read_file system procedure
The xp_read_file system procedure now includes an optional parameter that allows you to specify lazy reads. When you specify this optional parameter and its value is not zero, the file is read and then immediately unlocked.

Did you understand that?

No, neither did I, which might be why this new feature didn't get mentioned in the Top 10 Features in SQL Anywhere 12.

Experimentation shows that the xp_read_file "lazy" parameter might as well be called "fast" if you're interested in reading small substrings from large files; here's how you can use it:
  • Execute a SET @variable = xp_read_file ( 'filespec', 1 ) statement to open the file but NOT actually read all the data into @variable.

  • Read substrings of data from the file via SUBSTR ( @variable, @start, @length ) calls.
Here are the effects:
  • Both operations, the xp_read_file() call and the SUBSTR(), are much faster with lazy = 1 than with the default lazy = 0 . . . not just one of them is faster, but both of them are.
    Lazy xp_read_file() call          .28900 seconds.
    Lazy SELECT SUBSTR()              .32400 seconds.
    Non-lazy xp_read_file() call    74.70200 seconds.
    Non-lazy SELECT SUBSTR()       211.44400 seconds.
    

  • Setting lazy = 1 effectively implements the substring start and length parameters introduced with the new READ_FILE_SERVER() procedure.

  • For large files and small substrings, setting lazy = 1 virtually eliminates extreme RAM cache growth. In some cases this makes the difference between success and an unresponsive server.
Here's the code for the two tests:
-- Lazy test of 352485376 bytes

BEGIN
DECLARE @large_file_1  LONG VARCHAR;
DECLARE @substring_1   VARCHAR ( 100 );
DECLARE @started_at    TIMESTAMP;

MESSAGE STRING ( 'Lazy test starting.' ) TO CONSOLE;

SET @started_at = CURRENT TIMESTAMP;
SET @large_file_1 = xp_read_file ( 'D:/Foxhound1 - 3738 after long benchmark tests/foxhound1.db', 1 );
MESSAGE STRING ( 
   'Lazy xp_read_file() call took ',
   DATEDIFF ( MILLISECOND, @started_at, CURRENT TIMESTAMP ) / 1000.0,
   ' seconds.' ) TO CONSOLE;

SET @started_at = CURRENT TIMESTAMP;
                                
SELECT SUBSTR ( @large_file_1, 200000000, 100 );
MESSAGE STRING ( 
   'Lazy SELECT SUBSTR() took ',
   DATEDIFF ( MILLISECOND, @started_at, CURRENT TIMESTAMP ) / 1000.0,
   ' seconds.' ) TO CONSOLE;

END;

Now accepting requests
Cache size adjusted to 6816K
Cache size adjusted to 8532K
Lazy test starting.
Lazy xp_read_file() call took .28900 seconds.
Lazy SELECT SUBSTR() took .32400 seconds.

SUBSTR(@large_file_1,200000000,100)
'N@SiyŒŸ²ÅØë¨G$7J]pƒ–©¼Ïâõ\x08\x1b/ATgz\x8d ³ÆÙìÿ\x12%8K^q ... '
-- Non-lazy test of 352485376 bytes

BEGIN
DECLARE @large_file_2  LONG VARCHAR;
DECLARE @substring_2   VARCHAR ( 100 );
DECLARE @started_at    TIMESTAMP;

MESSAGE STRING ( 'Non-lazy test starting.' ) TO CONSOLE;

SET @started_at = CURRENT TIMESTAMP;
SET @large_file_2 = xp_read_file ( 'D:/Foxhound1 - 3738 after long benchmark tests/foxhound1.db', 0 );
MESSAGE STRING (
   'Non-lazy xp_read_file() call took ',
   DATEDIFF ( MILLISECOND, @started_at, CURRENT TIMESTAMP ) / 1000.0,
   ' seconds.' ) TO CONSOLE;

SET @started_at = CURRENT TIMESTAMP;
SELECT SUBSTR ( @large_file_2, 200000000, 100 );
MESSAGE STRING ( 
   'Non-lazy SELECT SUBSTR() took ',
   DATEDIFF ( MILLISECOND, @started_at, CURRENT TIMESTAMP ) / 1000.0,
   ' seconds.' ) TO CONSOLE;

END;

Now accepting requests
Cache size adjusted to 8008K
Cache size adjusted to 8808K
Non-lazy test starting.
Cache size adjusted to 9192K
Cache size adjusted to 53956K
Cache size adjusted to 257692K
Cache size adjusted to 409424K
Cache size adjusted to 825356K
Non-lazy xp_read_file() call took 74.70200 seconds.
Non-lazy SELECT SUBSTR() took 211.44400 seconds.
Cache size adjusted to 1282136K

SUBSTR(@large_file_2,200000000,100)
'N@SiyŒŸ²ÅØë¨G$7J]pƒ–©¼Ïâõ\x08\x1b/ATgz\x8d ³ÆÙìÿ\x12%8K^q ... '


2 comments:

John Smirnios said...

Yes, substr( xp_read_file( xxx, 1 ) ) is much the same as read_server_file.

The primary reason for adding read_server_file is to handle files that are >2GB in size. xp_read_file naively returns an error if the file is >2GB in size rather than just reporting an error if the substr being retrieved is >2GB (max size of a long binary). That problem could be addressed in xp_read_file; however, the arguments to substr and most of the internal string handling is limited to 2GB lengths. Therefore, we added substring semantics into read_server_file directly to allow 64-bit file offsets. read_server_file also has a 64-bit length parameter but that is just future-proofing: lengths >2GB are rejected.

-john.

Anonymous said...

John, that would be a valuable explanation to DCX, once read_server_file() is documented there (AFAIK, it is still missing).

Regards
Volker