Additional Clarity on Size Field #101

seanherron · 2013-07-31T04:28:59Z

The size field, which is intended to represent the file size of the resource, has many potential areas for improvement.

First off, we should either decide on a standard unit of measurement (like the dct:bytesize) or break out the unit of measurement from the numeric value (eg. two fields, size (numeric value) and sizeUnit (unit of measurement). This would allow for machine reading of the value, enabling users to sort and filter by the size of the resource, as well as reducing confusion when multiple standards of measurement are used.

Secondly, what is the rationale behind the cardinality enabling multiple values? Is this in relation to the possibility of having multiple accessURLs? If so, how do we draw a link between a specific accessURL and its size? We could specify that they be represented in order? I'm not a huge fan of that approach but I can't think of a better way to do it.

Finally, I think this should be renamed either bytesize (if it is represented in bytes) or filesize (if represented in other units of measurement). The rationale for this is that size can be interpreted to mean a variety of things (eg. size of geographic area covered, number of rows of data, etc). Bytesize or filesize clarify this.

mhogeweg · 2013-07-31T04:41:28Z

It's unclear to me what the purpose is of the size field. Especially when working with API and web services, 'size' depends on the specific request for (a subset of) the data and the format the data is returned in. Is it the size of a zipped file that's made available or the unzipped data? what is the size of the Landsat archive (millions of images collected over several decades) vs a picture of NDVI generated from this archive (on-the-fly as part of the web service request) for a small portion of the US?

seanherron · 2013-07-31T05:01:30Z

Good point - I thought about this myself. Government still distributes tons of data via raw file, probably way more often than via APIs or web services. In many circumstances, raw file access is probably the best way to do this, and accessURL (which size is linked to) is inclusive of direct download to raw files.

When accessURL is linked to a raw file, I would say that showing the size of the file is a good practice. If, as an extreme example, we linked to a gigantic zip file of the entire Landsat archive (but probably more realistically something like FDA SPL data, which is distributed in CSV), people should know the size of the file before they click on it, in particular if someone on a mobile connection wants to quickly check out some tabular data but doesn't realize the file is actually 300,000 records in a 40mb file or something.

As a side note, this is only useful if size changes as the file itself changes, which would necessitate either human intervention or server-side automation by agencies to update on a regular basis.

mhogeweg · 2013-07-31T05:27:38Z

Just your last point would make me concerned about relying on the currency/accuracy of the size attribute whenever I see it. People just don't (manually) update this type of metadata. This is speaking from over 10 years working with Geospatial One-Stop and Data.gov in the US and various National Spatial Data Infrastructures globally.

On the web, whenever I click a link to download a file, my browser tells me how much bytes I'm about to download. That's directly associated with the actual file/stream/thing I'm about to download. Isn't that enough information for someone to decide to continue or not?

You describe a use case where someone is on a mobile device wanting to get some data. Do you know if there's an activity related to Data.gov to collect/define/design the various use cases? What IS the expected use of Data.gov in that sense? Are there apps (mobile/web/desktop/...) that people are building using datasets/services found at Data.gov that would then be used for the things you describe? Would those apps be findable at Data.gov?

seanherron · 2013-07-31T05:41:44Z

I agree with your point that this is not something we can reasonably expect that people will manually update, hence my point about it (hopefully) being automated.

I'm not aware of an activity for data.gov to collect use cases, though I believe http://next.data.gov/ is hoping to achieve that to some extent.

Hopefully one of the authors of the schema can chime in here on why they felt size should be included. I'm with you on a lot of your points, and I admit that mobile downloads of data is pretty edge use case, and most other use cases I can think of (either bandwidth-constrained, bandwidth-capped, or storage constrained environments) would be negated by the fact that we don't really have a way of ensuring this value is correct in the first place.

MarionRoyal · 2013-07-31T13:08:07Z

The field SIZE was used in the standard Data.gov Metadata template in the
manner that you have presumed and was probably just carried over into this
schema. Originally, it was to provide the user an idea of the amount of
resources needed before making a choice to download a block of data (disk
space, time, ...) I could probably argue that this is good to know before
my browser informs me. It could be checked on a mobile app, before taking
some action. We (at data.gov) have never used "size" as a metric of our
progress in achieving open data and I don't believe it is a valid metric
going forward. Points well made on not being applicable to API's and web
services. So "size" probably rightfully deserves to carry on in the
Required if Applicable section. However, it will be applicable to the vast
majority of records.

With regards to changing the name of the field: As I age, I am becoming
less concerned or at least ambivalent on the nouns chosen to express a
concept (object) as long as the word is easily understood within a context
(or namespace if you will) and mappable to others. I am confident that
"size" in the context of this schema will not be confused with "dimension".
Having said that, it would probably be an improvement to recognize
DCAT:byteSize in future revisions. That, of course, unless we invent a new
noun to represent mass on a storage device.

On Wed, Jul 31, 2013 at 1:41 AM, Sean Herron [email protected]:

I agree with your point that this is not something we can reasonably
expect that people will manually update, hence my point about it
(hopefully) being automated.

I'm not aware of an activity for data.gov to collect use cases, though I
believe http://next.data.gov/ is hoping to achieve that to some extent.

Hopefully one of the authors of the schema can chime in here on why they
felt size should be included. I'm with you on a lot of your points, and I
admit that mobile downloads of data is pretty edge use case, and most other
use cases I can think of (either bandwidth-constrained, bandwidth-capped,
or storage constrained environments) would be negated by the fact that we
don't really have a way of ensuring this value is correct in the first
place.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/101#issuecomment-21841371
.

Marion A. Royal PMP
Program Director, DataGov
GSA Office of Citizen Services and Innovative Technologies
202.302.4634

seanherron · 2013-07-31T15:19:29Z

@MarionRoyal: Thanks for the background. In regards to converting size to recognize bytesize, I'm imagining that the schema ought mandate values be given in bytes rather than just allowing for byte values, otherwise we still have the issues I brought up in the original post, right?

MarionRoyal · 2013-07-31T19:26:17Z

@seanherron: If you are asking me if I think we also need a sizeUnit, I
would say no. I think the existing field is a text field rather than a
decimal field - which means that a valid entry could include the number of
bytes (if less than a kilobyte) or could include a set of alphanumeric
characters which would most likely include letters K, M, G, T, P and could
easily be grokked by an app (and maybe even a human). The problem with
have a sizeUnit for this purpose is that it would suggest a need for
controlled vocabulary for this new field, which I think we are trying to
avoid.

so, I would agree with changing the field name to byteSize (since it
matches DCAT) and would have no objection to fileSize (since it is a
recognized PHP term), but would leave sizeUnit to other more precise
domains.

On Wed, Jul 31, 2013 at 11:19 AM, Sean Herron [email protected]:

@MarionRoyal https://github.com/MarionRoyal: Thanks for the background.
In regards to converting size to recognize bytesize, I'm imagining that the
schema ought mandate values be given in bytes rather than just allowing for
byte values, otherwise we still have the issues I brought up in the
original post, right?

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/101#issuecomment-21870776
.

Marion A. Royal PMP
Program Director, DataGov
GSA Office of Citizen Services and Innovative Technologies
202.302.4634

MarinaNitze · 2013-08-01T02:46:09Z

I like @MarionRoyal's idea to adopt DCAT:byteSize -- but flag that not everyone knows what a byte is, so we should link to some sort of basic calculator folks can use to convert from more-familiar KB/MB/GB.

This topic has come up a lot. Ultimately, the size field is not a deeply reliable measure if we are asking people to populate it by hand, because file size changes if so much as a punctuation mark is edited in the source file, and is largely meaningless when applied to APIs, as outlined above. I think those of us who are more technical appreciate this, but we could stand to be clearer to the less-technical folks that they should not be using this field for any sort of precise measurement or for compliance purposes.

Since it's not precise, I am less inclined to make it fully machine-readable with separate size and sizeUnit is overkill, because if you're machine-reading you can probably also automatically calculate files' true sizes.

skybristol · 2013-08-06T17:35:55Z

I think the only thing that scales at the relatively crude level of discovery metadata currently being discussed is to do as @MarionRoyal suggests and leave it as a rough textual notification to downstream users. Best practice would be to include some type of units or explanation in the attribute so a human reading it might have a clue on what they are getting into. Otherwise, we'd need to look across various standards on how the magnitude of a given asset might be described and account for all the specifics.

gbinal · 2013-08-06T18:41:55Z

+1 for keeping this more textual and the use of letters K, M, G, T, P. I'm envisioning the spectrum of catalog creators and think that the low bar is appropriate here. I also don't think there'll be many use cases for machine-consumption of this field.

If so, wouldn't it then be best to stick with filesize so as to avoid the need for everyone to go to a filesize catalog each time?

seanherron · 2013-08-07T16:46:41Z

It seems like we're all in agreement that the field isn't particularly useful or relevant, so I'm going to go against my original idea and say we just leave as is to prevent complication. Maybe in the future if we look to pare down the schema this would be a good field to deprecate.

jpmckinney · 2013-08-07T18:11:36Z

Is this a duplicate of #55?

seanherron · 2013-08-07T19:13:52Z

Yes, looks like it. I can close this and reference 55 if you'd like. Didn't come across it when I was posting.

jpmckinney · 2013-08-07T19:16:54Z

@seanherron I've only skimmed the discussion in this thread, but makes sense!

seanherron closed this as completed Aug 7, 2013

skybristol mentioned this issue Aug 7, 2013

Use file size in bytes, which is less error-prone and conforms to DCAT #55

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Additional Clarity on Size Field #101

Additional Clarity on Size Field #101

seanherron commented Jul 31, 2013

mhogeweg commented Jul 31, 2013

seanherron commented Jul 31, 2013

mhogeweg commented Jul 31, 2013

seanherron commented Jul 31, 2013

MarionRoyal commented Jul 31, 2013

seanherron commented Jul 31, 2013

MarionRoyal commented Jul 31, 2013

MarinaNitze commented Aug 1, 2013

skybristol commented Aug 6, 2013

gbinal commented Aug 6, 2013

seanherron commented Aug 7, 2013

jpmckinney commented Aug 7, 2013

seanherron commented Aug 7, 2013

jpmckinney commented Aug 7, 2013

Additional Clarity on Size Field #101

Additional Clarity on Size Field #101

Comments

seanherron commented Jul 31, 2013

mhogeweg commented Jul 31, 2013

seanherron commented Jul 31, 2013

mhogeweg commented Jul 31, 2013

seanherron commented Jul 31, 2013

MarionRoyal commented Jul 31, 2013

seanherron commented Jul 31, 2013

MarionRoyal commented Jul 31, 2013

MarinaNitze commented Aug 1, 2013

skybristol commented Aug 6, 2013

gbinal commented Aug 6, 2013

seanherron commented Aug 7, 2013

jpmckinney commented Aug 7, 2013

seanherron commented Aug 7, 2013

jpmckinney commented Aug 7, 2013