Poor performance for small files

Last post 12-05-2008 5:29 PM by BarryR. 3 replies.
Page 1 of 1 (4 items)
Sort Posts: Previous Next
  • 12-04-2008 5:20 PM

    Poor performance for small files

    Hi, 

    I just wanted to share my experiences of using Nirvanix to upload/download lots of small files.

    We're developing a data backup application. Because of the delta compression system, it is common for us to need to upload lots of small files (~10kb).

    I have found that we get quite poor performance with small files in this manner with Nirvanix, even though we're using an efficient transport mechanism (i.e. multipart form posting).

    Now, some overhead to be expected: there is an inherent cost with lots of small files versus, say, a single large file of the same size. However, my gut feeling is that even with this overhead, Nirvanix seems unexpectly slow.

    Firstly, I'd like to seperate out my experiences into three seperate issues: HTTP upload, HTTP download and the web service calling. 

    Web service calling

    The problem here is simply that you need to make several web service calls for each file uploaded. This is just an inherent problem with webservices over HTTP. The average round trip for us is around 50ms, so if we upload lots of files sequentially and we need to make at least one WS call for each, we find our average upload speed just plummits - we're spending a lot of time waiting for the HTTP response.

    What would be great would be to have "aggregate" web services. So for example, the ability to call SetMetadata with multiple files/metadata in one go.

    HTTP Uploading

    We're using multipart form posting. We post several files in one go (up to 50) to try and eliminate the overhead of many HTTP calls. This means there is a single HTTP post which streams all the data for many files all in one go.

    When I watch the progress in Wireshark (a packet sniffer), all seems to go well: the initial POST occurs and then I see all the files' data pushed out in one go. During this part, the upload bandwidth of my broadband connection is maxed out - i.e. uploading nice as fast.

    However, the problem is that there is a huge pause (about 20 secs for 50 files) between the last data packet being sent and the HTTP OK being returned. I'm certain that this delay is at the Nirvanix server end because I can see in Wireshark that all the data has been sent out on the wire.

    I then compared this to uploading a single file of the same total size, and predictably the pause wasn't there. Now of course I totally appreciate that a certain amount of overhead is to expected as you increase the number of files, but the delays I'm seeing seem too high (around 500ms per file - not trivial up you're uploading 1000s of files).

    So the bottom line is, that uploading lots of small files meant that the *overall* upload speed is pitiful. I get a burst of really good speed while the data is posted, but then a huge pause.

    This feels like it shouldn't be the case: the total amount of data I was uploading wasn't very large (less than a megabyte), and the number of files wasn't very large either (50 files). It seems like  there is some per-file overhead at the server end, not related to the actual data size.

    Some good news is that this overhead seems to dissapear if you're file sizes are slightly larger. Also, I guess for most people this just won't be an issue (depending on the size of the files they want to upload).

    I've got a spreadsheet of some experiments if anyone at Nirvanix wants to take a look.

    HTTP downloading
    This is also not so performant: we have to make single webservice call for each file, and then a single HTTP GET call for each file. I guess the problem here is simply the overhead in the HTTP requests. What might be nice, again, is some kind of aggregate call which returns all the data in one go. Not sure how you'd do this with HTTP; I don't know if there is a download equivalent of the multipart form POST? Maybe a special call which recovers the files data zipped up?

    This one seems harder to solve.

     

    Well anyway, I just wanted to share my experiences and see if any of the Nirvanix staff have any comments. I'm using node2 (Europe I think) if that helps (though I don't think it is a temporary congestion issue).

    Regards,

    John

  • 12-04-2008 8:50 PM In reply to

    • BarryR
    • Top 10 Contributor
    • Joined on 07-20-2007
    • San Diego
    • Posts 710

    Re: Poor performance for small files

    Hi John,

    jdmwood:
    What would be great would be to have "aggregate" web services. So for example, the ability to call SetMetadata with multiple files/metadata in one go.
     

    We have tried to enable passing multiples to web services that we predicted would need to be called with multiples.  You will note the web service calls that are plural always allow multiple inputs.  The SetMetadata is a good idea and one I will add to our list of feature requests.

     

    jdmwood:

    This feels like it shouldn't be the case: the total amount of data I was uploading wasn't very large (less than a megabyte), and the number of files wasn't very large either (50 files). It seems like  there is some per-file overhead at the server end, not related to the actual data size.

    Some good news is that this overhead seems to dissapear if you're file sizes are slightly larger. Also, I guess for most people this just won't be an issue (depending on the size of the files they want to upload).

    I've got a spreadsheet of some experiments if anyone at Nirvanix wants to take a look.



    There is going to be some overhead as we validate each file.  I don't know that we have performed this metric on a larger scale.  We will recreate this issue and see if there is some optimization that can be done to improve it.  I would like to see your results if possible so we can duplicate your tests exactly.  You can just send me a private message here on the forums or email customersupport@nirvanix.com with the file (zipped please.)

    I would like to bring up our ability to ingest files sent to us on a USB hard drive as well.  If you are in a contract with us and have a large number of files to upload this is the best way to get them to us.  Essentially we take the hard drive, spider the contents and add it to the storage directly at the node of your choice.  This simplifies the process of moving files into our system when its the initial push of data outside of ongoing daily operations and is finished much faster.

     

    jdmwood:
    HTTP downloading
    This is also not so performant: we have to make single webservice call for each file, and then a single HTTP GET call for each file. I guess the problem here is simply the overhead in the HTTP requests. What might be nice, again, is some kind of aggregate call which returns all the data in one go. Not sure how you'd do this with HTTP; I don't know if there is a download equivalent of the multipart form POST? Maybe a special call which recovers the files data zipped up?

    The HTTP protocol does allow for a multi-part response to be returned.  I don't suspect this will be coming very soon.  We have had lengthy discussions on implementation of a Zip / Unzip (tar, gzip) that would take in an array of paths on our system and return the new compressed file.  The opposite as well where we would extract a compressed file onto our system.  We are still some distance from this solution mostly due to the problem of scalability.  For now, there is no simple answer other than compressing the files on the client then uploading the compressed file / downloading the compressed file just to reduce the number of files you are uploading.

     I appreciate the time you have spent giving this review.  I will attempt to address all of your isses through optimization or making sure the correct feature requests get submitted.  Feel free to contact Cory for more information on ingestion.

    Best Regards,
                Barry R.

    IM Support (Feel free to add me)

    MSN: barryruffner@live.com
    Gmail: barryruffner@gmail.com
  • 12-05-2008 10:06 AM In reply to

    Re: Poor performance for small files

    Hi Barry,

    Thanks for the excellent response.

    I've emailed the test results.

    In retrospect having looked over my notes, I suspect this issue really only affects very small files (<32kb). So, I hope I didn't uduly worry other users out there: I strongly suspect this isn't an issue for most applications.

    I think all the solutions you have suggested (plural SetMetadata, archived upload/downloads) are all great. I've subscribed to the Nirvanix RSS feed so I'll look forward with anticipation if/when these features get implemented.

    I've also had some success with running these uploads in a multithreaded manner: when I detect that we need lots of uploads of small files, I split these into several concurrent uploads on different threads. This seems to greatly improve the aggregrate performance. I suspect this probably isn't something you want to encourage for most users, so if you do find an obvious performance bottleneck in smaller file uploads, I'd love to hear about it.

     All the best, and thanks for your help.

     John

     

  • 12-05-2008 5:29 PM In reply to

    • BarryR
    • Top 10 Contributor
    • Joined on 07-20-2007
    • San Diego
    • Posts 710

    Re: Poor performance for small files

     I received your test results and will try to replicate the results locally.  We always appreciate any cretiques of our system, it allows us to see use cases that we don't immediately think of when designing our tests.  Everyone here is striving to make our service the best it can be and feedback like yours helps significantly.

    Thanks for the help,
           Barry R.

    IM Support (Feel free to add me)

    MSN: barryruffner@live.com
    Gmail: barryruffner@gmail.com
Page 1 of 1 (4 items)